A “Multivendor Post” to help our mutual NFS customers using VMware

-

multi-logo.jpg
We were quite a bit surprised to see how popular our “Multivendor iSCSI” post was. The feedback was overwhelming and very supportive of industry leaders partnering to ensure customer’s success with VMware. While writing that post, we (Vaughn Stewart from NetApp and Chad Sakac from EMC) discussed following up the iSCSI post with one focused on deploying VMware over NFS. The most difficult part around creating this post is that we couldn’t do it with our iSCSI-focused colleagues.

Since the original post, we’ve been busy assisting our customers and partners. We apologize for the delay, so without further ado we present to you the followup: a “Multivendor NFS” post for our joint customers. One of the goals of this post is to dispel the FUD customers often hear around NFS. Heck, if EMC and NetApp can agree – then you KNOW this post is FUD-Free!

We would like to thank Stu Baker and Satyam Vaghani from VMware, along with numerous folks at EMC and NetApp for their input on this post.

While any NFSv3 server will work with VMware, and there are many NFS servers on the ESX HCL, there is a significant difference between what one can do with an enterprise class NFS storage array from EMC or NetApp. The reality is only NetApp and EMC are supporting NFS deployments with VMware in significant volume.

Both of us personally are big supporters of NFS for VMware – but if you look at our post histories – we’re both also rational and try our best (we’re human, so sometimes we fail) to be balanced and neutral. We try to be good pragmatic voices, so our goal here is pragmatism and facts to help our mutual customers.

For more – read on…

Ok – let’s get a couple things off the table right off the bat:

1) “Is NFS a good, highly available, high performance option for VMware and deserves equal consideration as with the more traditional SAN choices?” – YES.

2) “Is NFS the be-all end-all storage protocol for VMware?” – NO.

Let’s breakdown the myth-busting and best practices into the following: 1) Performance and Scaling; 2) High Availability;

1) Performance.

Often – people are dismissive of NFS performance. In our experience, this is rooted in the fact that NAS originated outside the datacenter (engineering/development), leveraging existing “cheap and dirty” (and effective!) LAN design, and with poor performance client-mode NFS clients running on what at the time where very limited CPU cycles.

This is pretty well the opposite of the origins of SANs which originated in the datacenter, ran on “relatively expensive and lossless” (and effective!) SAN design, and with high performance hardware and kernel-mode drivers.

This argument reminds both of us of those who said IP would never be able to provide the quality required for telephone conversations. Never bet against Ethernet! This should be evident today with Cisco’s unified networking architecture. Consider:

1. NAS is widely deployed in the datacenter today.

2. It’s possible to build “bet the business” Ethernet infrastructure, even including lossless characteristics traditionally associated with Fibre Channel. This lossless behavior is exactly what is being delivered with Datacenter Ethernet. 10 Gbps throughput, very low latency, very low jitter and lossless characteristics that match the fastest FC SANs.

3. NFS clients, like most iSCSI initiator, aren’t free. They cost CPU cycles; however CPU cycles are cheap and readily available. In fact, the abundance of CPU cycles has enabled us to virtualize our servers. This trend is accelerating. That said, with workloads where you are measuring every ESX host CPU cycle, or the workload density is gated on ESX host CPU cycles, a cost/benefit trade off should be considered – don’t base your thinking on CPU cycles alone.

Performance consideration #1 What Kind of Data Are You Serving?

We would like to suggest the notion that an ESX server may require three types of storage that we will label: Physical, General Purpose Shared, and High Performance. We would like to share our view on these three types and the characteristics of each. Remember that the goal for our mutual customers is to virtualize all workloads, all applications, all use cases – and doing it in the most simple and efficient way. Flexibility is paramount.

Physical Device Access

This is the easiest form of storage models to understand as it is very traditional. It is the storage model that is required by a physical ESX server in order for it to boot and run. This storage could be direct attached storage, or alternatively it could be a FC / FCoE / or iSCSI LUN.

General Purpose Shared Storage Pools

As you know, Virtual Machines are comprised of files that for production functionality must reside on a shared storage architecture. General purpose VMs which are consolidated and stored in a shared storage pool may individually have moderate I/O requirements; however, their aggregated I/O load can be quite substantial.

VMware hit it out of the park; they developed VMFS, a clustered file system, which made it simple to have multiple ESX hosts simultaneously access a shared filesystem. Traditionally, clustered host filesystems are extremely complex.

In VMware ESX 3.x, VMware added support for NFS, which is natively a shared storage medium. With vSphere 4, NFS datastores support all the major VMware features at parity with VMFS. If one requires greater VM to datastore density, NFS can scale to general purpose VM densities equal to and often beyond what is possible with VMFS.

A key element of VMFS is the SCSI architecture that includes a command queue limit or a limit to the number of commands that can be addressed simultaneously by the LUN. In general this LUN and HBA queue limit is the limit to VMFS scaling as noted in this whitepaper.

While VMFS VM to Datastore density can in theory match NFS from a VM per datastore scale, it requires advanced configurations that allow increased LUN queues. Examples of this include spanning VMFS volumes across multiple LUNs. These types of designs achieve the parallelism that exists by definition internally on an NFS server, which obscure all block-level queue management (LUN queues support the underlying filesystems in NAS devices, but generally there are many LUNs used for a single filesystem, and many queues – and this is all invisible as far as VMware is concerned).

Spanned VMFS volumes in essence replicating what the NFS server makes simple (taking many block devices and creating a shared filesystem from them). With NFS datastores, the VMware NFS client simply logs into the NFS server – which handles all of the back-end I/O. The ability to have fewer larger datastores benefits IT operations by reducing storage management operations including provisioning, replication, backup, etc for this general purpose shared pool use case.

High Performance Datasets

As customers virtualize more and more servers they eventually have to design configurations in order to address the resource requirements of more demanding applications. Examples of demanding applications are Microsoft SQL Server or Oracle Database Server.

These types of application are characterized scale in two different dimensions: 1) Unlike the general purpose VMs, whose individual IO workloads are light, but have a large aggregate IO requirement, these are the reverse – a single VM has a large I/O requirement (can be IOps or MBps); 2) often the application best practices require their IO workload to be isolated from the IO workload of other systems. These design considerations are the same whether the application is deployed as a physical or a virtual server.

VMware, NetApp, & EMC all recommend that an applications with high I/O requirements or one which is sensitive to latency variation – these require a storage design that focuses on that particular VM, and should be isolated form other datasets. Ideally, the data will reside on a VMDK stored on a datastore that is connected to multiple ESX servers, yet is only accessed by a single VM. The name of the game with these workloads isn’t scale in terms of VMs per datastore, but scaling the performance of one VM.

Also – certain specific use cases require very specific guest-level SCSI task management (aborts, resets, etc). This is usually true of clustered apps (hence some of the existing RDM requirements). When virtualizing VMDKs, the ESX storage virtualization layer needs to map task management requests to primitives that are understood by the underlying layer. In case of VMFS, this means mapping virtual task management requests to physical SCSI device task management requests. This is straightforward. In case of NFS, there are no analogs to SCSI task management, hence aborts and resets can only be processed on a best effort basis (for eg, a command can only be aborted if it hasn't been issued on the wire — once it is, there is no way for the host to convey a cancellation to the NFS server, etc). Beyond the clustered use cases – these are exceeding rare.

Performance consideration #2: Design a “Bet the Business” Ethernet Network

Can one run NFS datastores on any off-the shelf GbE switches? Yes – but it’s not a good idea. Remember that you are designing a storage network that needs to have a performance/availability profile to support your VMware cluster, and that the aggregate availability of your

• Separate your IP storage and LAN network traffic on separate physical switches or be willing and able to logically isolate them using VLANs.

• Enable Flow-Control

• Enable spanning tree protocol with either RSTP or portfast enabled

• Filter / restrict bridge protocol data units on storage network ports

• Configure jumbo frames (always end-to-end – meaning in every device in all the possible IP storage network paths). Support for Jumbo Frames for NFS (and iSCSI) was added in VMware ESX 3.5U3 and later.

• Strongly consider using Cat6 cables rather than Cat5/5e. Can 1GbE work on Cat 5 cable? Yes. Are you building a “bet the business” Ethernet infrastructure? Remember that retransmissions will absolutely recover from errors – but have a more significant impact for these IP storage use cases than in general networking use cases.

• Ensure your Ethernet switches have the proper amount of port buffers, and other internals to properly support NFS (and iSCSI) traffic optimally

• While vSphere adds support for IPv6 for VM networks and VMkernel networks – IPv6 for VMkernel storage traffic is experimental at the initial vSphere release

• with NFS datastores – strongly consider switches which support cross-stack Etherchannel or Virtual port Channeling technologies. (This will become apparent during the HA section)

• with NFS datastores – strongly consider 10GbE or a simple upgrade path to 10GbE as an important Ethernet switch feature.

Performance consideration #3: Think about throughput (MBps)

There are 3 primary measures of storage performance – bandwidth (MBps), throughput (IOps) and latency (ms). Throughput and bandwidth are related in the sense that the throughput needed is the bandwidth x the I/O size. People sometimes confuse filesystem allocation size (4K default in NTFS, 4K for WAFL, 8K for UxFS) – but they are unrelated. The I/O size is the size of the I/O operation from the host perspective.

IOps are usually gated by the backend configuration, whereby backend we mean the array target. If the workload is cached, then it’s determined by the cache response (which is almost always astronomical), but most often, it’s by the spindle configuration that supports the storage object. In the case of NFS datastores, the storage object is the filesystem. So, on a NetApp FAS, the IOps achieved are primarily determined by the number of disk drives in an Aggregate, and likewise on a Celerra they are primarily determined by the Automated Volume Manager configuration. Yes, there are other considerations (at a certain point, the FAS/Datamovers themselves as well as the host ability to generate IOs become limits), but up to the things most people run into – it’s the backend.

Ok – next thing to understand is that every NFS datastore mounted by ESX (including vSphere – though NetApp and EMC are both collaborating for longer term NFS client improvements in the vmkernel) uses two TCP sessions – one for NFS control information, and the other for NFS data flow itself.

Slide1.jpg

This means that the vast majority of the traffic to a single NFS datastore will use a single TCP session. What this means is that the upper limit throughput achievable for a single datastore – regardless of link aggregation or other methods – will use a single link for the traffic to that datastore.

The key to this is understanding how Link Aggregation works. We strongly recommend going back and reading the section on “Understanding Link Aggregation” in the ESX/ESXi 3.5 iSCSI post – as it’s equally pertinent here. Seriously – go there now…

You back? Ok, now you understand why a the NFS datastore dataflow being on one TCP session will result in a single link being used no matter how it’s configured.

As we covered, if you are using 1GBps this means that a reasonable expectation is a unidirectional read/write workload of ~80-100Mbps (GbE is full duplex – so this can be 160MBps bidirectionally with a mixed workload)

Higher total throughput on an ESX servers can be achieved by leveraging multiple datastores. You can scale up the total throughput to multiple datastores via link aggregation and routing mechanisms.

What type of virtual machine workloads are well suited to NFS? A shared datastore comprised of many VMs with an aggregate requirement within the guidelines above (can be large amounts of IOps, but generally lots of small-block I/O – and not large block I/O that needs more bandwidth than the one GbE link can provide), or a A single busy as long as its I/O load can be served by a single GbE link.

Now, these performance parameters can be enough for MANY use cases – so don’t write it off.

With small block I/O (like 8K) – this is 12,500 IOPs – or put differently, roughly the performance of 70 15K spindles. But, on the other end, if you have a Sharepoint VM (or are doing a guest-level backup) – they tend to do IO sizes of 256K or larger. With 256K IO sizes, that’s 390 IOPs – or the performance of roughly 2 15K spindles – and likely not enough.

Another option is 10GbE.

If you use 10GbE – though a single TCP session will be used per datastore there is much more throughput available for your most demanding workloads; however, I’d add if you have 10 GbE you probably have access to FCoE & iSCSI, and this flexibility may be required for supporting some of you most demanding workloads.

If 10GbE isn’t an option – you can always use NFS for some VMs and FC for others.

So – what do the economics look like?

While 10GbE prices per port are higher today than 4Gbps FC, 10GbE prices are starting to drop rapidly, and we expect it to continue to drop through 2009, and this trend will be accelerated as 10GbE LoM (LAN on Motherboard) starts to become more prevalent. Also – from a TCO (acquisition, cabling, power, space, etc) standpoint, 10GbE Datacenter Ethernet like the Cisco Nexus 5000 series is comparable to separate 1Gbps Ethernet and 8Gbps FC today.

If you’re looking at FC and NFS together – take a good look at the 2nd generation FCoE converged adapters, the Cisco Nexus 5K. FCoE configurations are supported by VMware, Cisco, EMC e-Lab and NetApp – so while these are early days, customers can being to evaluate in earnest.

So – how many datastores? There is no hard and fast rule here – but the recommendation For peak performance – increase the number of NFS datastores using the ESX advanced settings here from 8 to a higher number (this is a vSphere screenshot, but the same advanced property is available in ESX 3.5 – only difference in vSphere the max is 64, not the 32 maximum of VI3.5). When you increase the NFS datastore count, increase the heap memory assigned to and available to the networking stack (including the NFS client) – and do this across all ESX hosts

• increase Net.TcpIpHeapSize to 30. This immediately increases the heap memory to 30MB

• Increase Net.TcpIpHeapMax to 120. This increases the maximum heap memory for to 120MB

cfg1.jpg

With EMC Celerra there are a couple other important NFS related settings:

• On the Celerra filesystem supporting the NFS export:

o Enable the uncached write mechanism for all file systems (30% + improvement)

Cellera1.jpg

o Disable the prefetch read mechanism for file systems consisting of VMs with small random accesses patterns

Cellera2.jpg

Performance consideration #3: Plan your NFS server design accordingly

In general, consider on both performance and capacity axes – you need to design for meeting capacity requirements (TB), and performance (MBps, IOps, latency). You should employ every method you can to be as efficient as you can, but you need to make sure that you plan to have enough spindles behind the filesystem that is supporting the NFS export to support the aggregate IOPs workload needed by all the VMs in the datastore. This isn’t hard to estimate – just measure a representative host(s) using perfmon, top, or the VMware Capacity Planner. It also is easy to fix if you have enough backend spindles – expand the filesystem (simple on both NetApp and EMC Celerra) – and in vSphere, storage vmotion is supported with NFS datastores as sources or targets, so you can re-balance datastores as needed.

2) High Availability

NFS uses a different model for HA design than native block devices – but you can absolutely create high-availability configurations.

HA consideration #1: Network and NFS server design

The first core difference is that block (iSCSI/FC/FCoE) use an initiator-to-target multipathing model based on MPIO. The domain of the path choice is from the initiator to the target. For NAS – the domain of link selection is from one Ethernet MAC to another Ethernet MAC – or one link hop. this is configured from the host-to-switch, switch-to-host, and NFS server-to-switch and switch to NFS server, and the comparison is shown below (note that I called it “link aggregation”, but more accurately this is either static NIC teaming, or dynamic LACP):

NFSSlide2.jpg

The mechanism that is used to select one link or another are fundamentally:

• A Link Aggregation choice – which is setup per TCP connection – and is either static (setup once and permanent for the duration of the TCP session) or dynamic (can be renegotiated while maintaining the TCP connection – but still always on one link or another)

• A TCP/IP routing choice – where an IP addres (and the associated link) is selected based on an layer-3 routing choice.

Note: Out of the box ESX/ESXi does not support dynamic LACP; however, Cisco’s 1000V vDS does provide this functionality along with numerous other enhancements which could take another blog post to discuss.

Here’s the basic decision tree:

NFSSlide5.jpg

 

The path on the left has a topology that looks like this (note that the little arrows mean that you must configure the link aggregation/static teaming from the ESX host to the switch and on the switch to the ESX host, and the same “setup on both sides” for the switch-NFS server relationship):

NFSSlide3.jpg

The path on the right looks has a topology that looks like this (you can use link aggregation/teaming on the links – remembering that it won’t help with a single datastore – but routing is the selection mechanism):

NFSSlide4.jpg

 

HA consideration #2: NFS Client Timeout considerations

NAS device failover is generally longer than a native block device, block devices generally failover after a “front end” failure in seconds (or milliseconds), NAS devices tend to failover in 10’s of seconds (can be longer depending on the NAS device and the configuration specifics). This often gets thrown by “block-heads” (the equivalent of a “NAS-bigot” – both types are equally dangerous 🙂 around to instill FUD. The question is how much time elapses before ESX does something about it, and what’s the guest behavior during that time period.

First – the same timeout concept exists with block storage but the failover time is extremely rapid in almost all cases. Failed path detection is generally as soon as the first I/O fails for Fibre Channel and FCoE, and actual path change occurs within as soon as one of the SCSI commands that signal a dead path (NOT_READY, ILLEGAL_REQUEST, NO_CONNECT and SP_HUNG for MRU arrays, or NO_CONNECT for Fixed arrays). This time period are configurable (steps vary by HBA), but the defaults are good in almost all cases, are within the common guest OS timeout values, and are measured in low seconds. In vSphere, the behavior is controlled by the Path Selection Plugin in vSphere (and path state is handled by the Storage Array Type Plugin). Third Party Multipathing Plugins can further optimize this behavior. Second – ESX and guest timeouts can be extended to survive reasonable FAS/Datamover failover intervals.

Third – use cases have varying tolerances for this behavior – some are perfectly fine with long timeouts, requiring no changes. Others are more sensitive.

Both NetApp FAS and EMC Celerra recommend the same ESX failover timeout settings. We recommend increasing the default values to avoid VMs being disconnected during a FAS/Datamover failover event.

cfg2.jpg

 

The recommended settings both EMC and NetApp recommend (do this across all ESX hosts)

NFS.HeartbeatFrequency(NFS.HeartbeatDelta in vSphere) = 12

NFS.HeartbeatTimeout = 5

NFS.HeartbeatMaxFailures = 10

The way these work:

1. Every “NFS.HeartbeatFrequency” (or 12 seconds) the ESX server checks to see that the NFS datastore is reachable.

2. Those heartbeats expire after “NFS.HeartbeatTimeout” (or 5 seconds), after which another heartbeat is sent.

3. If “NFS.HeartbeatMaxFailures” (or 10) hearbeats fail in a row, the datastore is marked as unavailable and the VMs “crash”.

This means that the NFS datastore can be unavailable for a maximum of 125 second before being marked unavailable which covers the large majority of both NetApp FAS and EMC Celerra failover events.

Now – what does a guest see during this period? It sees a non-responsive SCSI disk on the vSCSI adapter. The disk timeout is how long the guest OS will chill as the disk is non-responsive. To set operating system timeout for Windows servers to match the 125 second maximum set for the datastore:

1. Back up your Windows registry.

2. Select Start>Run, type regedit.exe and click OK.

3. In the left‐panel hierarchy view, double‐click HKEY_LOCAL_MACHINE, then System, then CurrentControlSet, then Services, and then Disk.

4. Select the TimeOutValue and set the data value to 125 (decimal).

Additonal Recommended Reading

1) We would STRONGLY recommend reading a series of posts that the inimitable Scott Lowe has done on ESX networking, and start at his recap here

2) Also – prior to getting started we recommend all deployments read our documentation.

EMC Celerra: VMware ESX Server Using EMC Celerra Storage Systems – Solutions Guide

EMC Celerra: VMware ESX Server Optimization with EMC® Celerra® Performance Study – Technical Note P/N 300-006-724

NetApp: NetApp & VMware Virtual Infrastructure 3: Storage Best Practices

In conclusion – NFS is an absolutely legitimate storage model for VMware – with many advantages. It deserves consideration along with all the other storage options available. As with everything – success is determined not only by technological factors, but design – and most importantly – the customer’s experience with various technologies and models. As unified networking or 10 GbE becomes the norm we expect to see customers to deploy a mix of storage protocols as each has their pros and cons.

I’d like to thank my friend, competitor and partner in the blogosphere for making this post happen. We hope you find this information helpful and more importantly useful in the design of your virtual data center.

Vaughn Stewart
Vaughn Stewarthttp://twitter.com/vStewed
Vaughn is a VP of Systems Engineering at VAST Data. He helps organizations capitalize on what’s possible from VAST’s Universal Storage in a multitude of environments including A.I. & deep learning, data analytics, animation & VFX, media & broadcast, health & life sciences, data protection, etc. He spent 23 years in various leadership roles at Pure Storage and NetApp, and has been awarded a U.S. patent. Vaughn strives to simplify the technically complex and advocates thinking outside the box. You can find his perspective online at vaughnstewart.com and in print; he’s coauthored multiple books including “Virtualization Changes Everything: Storage Strategies for VMware vSphere & Cloud Computing“.

Share this article

Recent posts

6 Comments

  1. Apologize for the long comment, but I have some (lots) questions about this… I did read the Ling Aggregation Fundamentals in the iSCSI post, but it mostly led to more questions. 🙂
    Let’s start with ESX and the Cross-Stack Etherchannel setup. From what I remember, when patching/updating a switch stack, all the individual switches are updated at once, meaning possible outages for the servers without an available alternate path. That has me hesitant to use them to enable NIC teaming in ESX when one purpose of multiple NICs is to improve redundancy. Also, I don’t think ESX supports multiple NIC teams in a single vSwitch. Meaning in one vSwitch, we can’t have 2 teamed NICs linked to one switch stack and another 2 linked to a second switch stack.
    That leads me to prefer the second HA connectivity diagram you have, with multiple vSwitches each with their own vmKernel, but in that case, what does the availability look like when there’s a network failure? Are all datastores accessible via all paths? i.e. If the link fails (or the switch is updated…) connected to the first NIC, can the datastores be accessed via the second? If so, how does that failover occur? What configuration/connectivity steps do you have to take to get that working?
    Also, you mention multiple IPs on the array in both HA diagrams. Is that one IP per NIC on the array? Per NFS vmKernel on the ESX hosts? Per datastore? Other? Is it correct to assume that all the IPs are associated with a single NIC team, and that the array allows that…?
    Thanks!

  2. Excellent post, really enjoyed reading it.
    How can I calculate the amount of bandwidth my virtual machines will require prior to virtualisation? For example, if the VM requires more bandwidth than 1GbE can offer then it’s a little late after you’ve virtualised the machine to find this out.
    I also noticed that the post from July 22, 2009 at 06:19 AM wasn’t answered. I’d be interested in a response.
    Thanks

Leave a Reply

Recent comments

Jens Melhede (Violin Memory) on I’m Joining the Flash Revolution. How about You?
Andris Masengi Indonesia on One Week with an iPad
Just A Storage Guy on Myth Busting: Storage Guarantees
Stavstud on One Week with an iPad
Packetboy on One Week with an iPad
Andrew Mitchell on One Week with an iPad
Keith Norbie on One Week with an iPad
Adrian Simays on EMC Benchmarking Shenanigans
Brian on Welcome…