Heads up! Avoiding VMware vSphere ESXi 5 NFS Disconnect Issues

Best Practices NAS VMware

February 8, 2013February 14, 2013Vaughn Stewart

There is a problem that impacts our mutual customers running VMware ESXi 5 with NFS connectivity. While details are not finalized it appears the engineering teams at VMware and NetApp have identified an issue with the NFS client in ESXi 5 stack and the NFS service in Data ONTAP that results in the two behaving badly with high I/O load, when SIOC is not in use. This issue appears to only affect vSphere 5 releases and not vSphere 4 or VI3 and FAS arrays with less than 2 CPUs, thus it is seen across the FAS2000 series and lower-end systems in the FAS3000 series. I cannot state wether this issue may impact other NFS platforms like EMC isilon, Celerra & VNX.

Massive investments in engineering resources go into assuring quality of product releases and joint solutions; inevitably something falls through the cracks and this is one of those times. For those impacted by this issue, my apologies. The NetApp and VMware engineering teams have been furiously working to identify and resolve this issue. A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds.

Clarifying The Issue:

A NFS datastore disconnect issue displays the following behaviors…

NFS Datastores are displayed as greyed out and unavailable in vCenter Server or the vSphere client
Virtual Machines (VMs) on these datastores may hang during these times
NFS datastore often reappear after a few minutes, which allows VMs to return to normal operation.
This issue is most often seen after ESXi 5 is introduced into an environment

This issue is documented in VMware KB 2016122 and NetApp Bug 321428

The Fix:

NetApp customers can upgrade Data ONTAP to correct this issue. Versions 7.3.7P1D2 & 8.0.5 have been released and the forth coming 8.1.3 is expected soon. While Data ONTAP upgrades are non-disruptive, they should likely be scheduled for times of reduced I/O activity.

Note: Data ONTAP release families are defined as 7.3.x, 8.0.x, and 8.1.x, with each dot release introducing a new set of features and capabilities. To address a bug, NetApp support suggests applying the DOT version containing the fix based on the DOT installed on your array.

The Workarounds:

While a Data ONTAP upgrade is non-disruptive some VMware administrators may prefer to address the issue immediately to ensure operations. For those interested VMware has published the following:

Workaround Option #1 – Enable SIOC

For those with vSphere Enterprise Plus license, enabling Storage I/O Control will eliminate this issue at it manages the value of MaxQueueDepth.

Workaround Option #2 – Limit MaxQueueDepth

For those without a vSphere Esential Plus license or those who have not enabled SIOC, setting a manual limit on the MaxQueueDepth will prevent the disconnect issue from occurring.

For the step-by-step procedure on how to complete this process in the vSphere Client, vSphere 5 Web Client and on the command line please see VMware KB 2016122.

Considerations of the Workarounds:

I would advise these workarounds be implemented on a temporary basis and remain in place until the NetApp FAS array(s) have been upgraded; at which point these workarounds should be disabled.

The reason for this suggestion is when one implements an I/O limit, such as a queue depth of 64 from the default 4.23 billion, there is a potential of creating faux I/O bottleneck. vSphere is equipped to remedy such issues via data migration technologies like SDRS; however, please note that shuffling data produces a negative impact on storage and networking resources for those using disk-based backups, data deduplication, and data replication with VMware.

I will edit this post should additional information be made available.

I’d like to thank Cormac Hogan for helping raise awareness with his post.

Vaughn Stewart

Vaughn is a VP of Systems Engineering at VAST Data. He helps organizations capitalize on what’s possible from VAST’s Universal Storage in a multitude of environments including A.I. & deep learning, data analytics, animation & VFX, media & broadcast, health & life sciences, data protection, etc. He spent 23 years in various leadership roles at Pure Storage and NetApp, and has been awarded a U.S. patent. Vaughn strives to simplify the technically complex and advocates thinking outside the box. You can find his perspective online at vaughnstewart.com and in print; he’s coauthored multiple books including “Virtualization Changes Everything: Storage Strategies for VMware vSphere & Cloud Computing“.

Website http://twitter.com/vStewed

50 thoughts on “Heads up! Avoiding VMware vSphere ESXi 5 NFS Disconnect Issues”

Lars Wean says:

February 9, 2013 at 10:55 am

Be advised that none of the current workarounds are sufficient, and that they lower the performance of the storage. Have SIOC turned on and stll gets > 60 sec. hangs during high I/O

Reply
Vaughn Stewart (@vStewed) says:

February 9, 2013 at 10:59 am

Lars, Have you upgraded Data ONTAP? If so, has that addressed your issues?

Reply
Lars Wean says:

February 9, 2013 at 11:21 am

Vaugn, we are running 8.1.2 7-Mode. Our local NetApp support apparently has no clue about the issue, they are advising us to optimize file layout etc. Will have a chat with them asap 😉

Reply
1. Vaughn Stewart says:
  
  February 9, 2013 at 12:23 pm
  
  Point them to the bugs in the post, should help 🙂
  
  Reply
2. Lars Wean says:
  
  February 10, 2013 at 1:05 am
  
  Just to be clear, it is an NetApp issue. ISCSI clients hangs/disconnects during the hangs as well, and NetApp logs the event in etc/messages log locally
  
  Reply
  1. Vaughn says:
    
    February 10, 2013 at 9:59 am
    
    Different issue…
    
    Reply
    1. Lars Wean says:
      
      February 10, 2013 at 11:03 am
      
      @Vaughn strange, both within 3 seconds, ref NetApp /etc/messages log entries
      
      Mon Jan 28 13:46:46 CET [XXvmware_vfiler@XX-netapp02b: NwkThd_00:warning]: NFS response to client 10.28.0.48 for volume 0x2997b33f(XXvmware) was slow, op was v3 setattr, 140 > 60 (in seconds)
      Mon Jan 28 13:46:49 CET [XX-netapp02b:iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:XX-vmware01-5014b0da) sent Abort Task Set request, aborting all SCSI commands from this initiator on lun 0
      
      Maybe altogether a new issue then, occuring on several vfilers here. Moving the VM’s to 3Par atm. for some relief….
      
      Reply
Erik Vrang says:

February 11, 2013 at 1:58 am

Vaughn, you mention 7.3.7P2 in your post. The latest version I can see at Netappsite is 7.3.7P1.
Do you have any information about a 7.3.7P2 being released?
/E

Reply
1. Vaughn says:
  
  February 12, 2013 at 6:34 pm
  
  I haven’t been able to verify, is the code available now?
  
  Reply
2. Peter Learmonth says:
  
  February 13, 2013 at 12:21 am
  
  It’s 7.3.7P1D2. To get to it, on the SW download page, scroll down to “To access a specific [drop down] version, enter it here:” In the drop down, select Data ONTAP and enter the version. The rest is straight forward.
  
  Reply
3. stewdapew says:
  
  February 14, 2013 at 9:40 am
  
  Erik
  
  I have edited the post – the code you seek is 7.3.7P1D2
  
  Reply
Joe Sanchez says:

February 12, 2013 at 5:01 am

Thanks for the heads up!

Reply
TOMC says:

February 12, 2013 at 12:13 pm

Looks like EMC got to make code changes to thier benefit.

Reply
1. Vaughn says:
  
  February 12, 2013 at 6:34 pm
  
  Pure speculation in your part.
  
  I think it’s more likely that the default EMC BP is to use SIOC, which avoids the issue while also capping IO.
  
  Reply
Eric Johnson says:

February 12, 2013 at 6:10 pm

Was just told my Netapp support this bug only applies to FAS 2020. Is that true?

Reply
1. Vaughn says:
  
  February 12, 2013 at 6:33 pm
  
  I would expect this issue to be related to an incompatibility condition created by the ESXi 5 NFS client and Data ONTAP and not to a hardware platform.
  
  With that said the NetApp support team are the experts.
  
  Reply
  1. Eric Johnson says:
    
    February 12, 2013 at 7:56 pm
    
    I’m not so sure about his explanation and have asked for clarification. The VMware kb specifically says “This issue can occur on any NetApp filer, including the FAS 2000/3000 series.” Also, my host side logs look pretty similar to what’s in the kb. Only difference is when my datastores check out they aren’t reconnecting. Which is problematic.
    
    Reply
    1. Peter Learmonth says:
      
      February 13, 2013 at 12:35 am
      
      VMware are not the experts on NetApp hardware or how our code works. They based that statement on seeing the problem on mostly FAS2000 and a small number of FAS3140s and assumed that if it hits 2 families, it hits all of them. Internal NFS code review shows it is not possible to hit this on higher models with 4 or more CPUs. Two cases reporting this issue on higher models were triaged to other issues. (In other words, “This isn’t the bug you’re looking for.” 😉 Our review of the draft VMware KB was before this code review, or we would have told that to the author.
      
      Reply
      1. stewdapew says:
        
        February 14, 2013 at 9:42 am
        
        THX Peter
      2. Eric Johnson says:
        
        February 18, 2013 at 11:30 am
        
        Thanks for the response (although it does put me back to square one on figuring out what happened). Is there a definitive list of which models ARE affected?
      3. Eric says:
        
        May 23, 2013 at 8:12 am
        
        Bug notes in link above list the following models:
        ***************snip*****************
        Some recent releases of various vendors’ NFS clients appear to pursue an
        aggressive retransmission strategy which may increase the chances that this
        problem could occur.
        
        The following storage-controller models are known to be susceptible:
        
        FAS250
        FAS270
        FAS940
        FAS960
        FAS2020
        FAS2040
        FAS2050
        FAS3020
        FAS3040
        FAS3050
        FAS3140
        FAS3210
        FAS6030
        FAS6040
        N3300
        N3400
        N3600
        N3700
        N5200
        N5300
        N5500
        N6040
        N6210
        N7600
        N7700
        
        A so-called “partial fix” in some releases is not effective to prevent
        several of the modes of failure.
        
        A related bug is 654196.
Lars Wean says:

February 19, 2013 at 1:35 am

FYI Just to clarify why enabling SIOC may not be sufficient/does not prevent the issue/hang

1) If the filer is running other workloads, sharing NFS to non SIOC clients and/or
2) If/when the filer runs internal jobs like dedup/snapshot cleanup

Then the following may kick in as a warning at the vcenter alarm view
Unmanaged I/O workload detected on shared datastore running Storage I/O Control (SIOC) ”

vmware then briefly disables SIOC, see below why and the filer may hang.
Throttling in this case would result in the external workload getting more and more bandwidth, while the vSphere workloads get less and less. Therefore, SIOC detects the presence of such external workloads, and as long as they are present while the threshold is being exceeded, SIOC competes with the interfering workload by reducing its usual throttling activity.

Reply
Linus J says:

March 11, 2013 at 1:00 am

Hi,

We have a customer running 5.1 and ONTAP 8.1.1p1.
They have these connectivity issues most likely due to NFS queue depth.

Reading the KB´s we adviced them to enable SIOC on the datastores.
This did not solve the issue, my question now is..

Customer wants to have SIOC enabled, BUT it seems that he also needs to set the NFS.MaxQueueDepth to 64 to avoid this issue.
What are the consequenses of having SIOC enabled and NFS.MaxQueueDepth set to 64?
Will SIOC still compete for resources?
Is this not a recommended combination of settings?

BR
Linus

Reply
1. stewdapew says:
  
  March 11, 2013 at 8:50 am
  
  Linus,
  
  My understanding is NetApp Customer Success Services (CSS) is planning to address this issue in the 8.1 code line beginning with 8.1.2p4.
  
  The workarounds carry a potential downside of increased SDRS migrations (due to IO greater than the queue limit). As stated in the post, I’d implement the workarounds until the patch is available.
  
  Reply
  1. S.Meyer says:
    
    May 8, 2013 at 12:43 am
    
    Vaughn, when will 8.1.2p4 be available??? The workarounds are no option for our customer. New MetroCluster installation with vSphere 5.1. What are your suggestions?
    
    Reply
S.Meyer says:

March 12, 2013 at 1:23 am

Vaughn, when will 8.1.2p4 or better 8.1.3 be available??? We are talking about weeks or months?

Reply
Shinji Nakamoto says:

March 15, 2013 at 10:50 am

I’m experiencing similar disconnects in my environment, but our stats don’t fall under the acknowledged affected devices.

ESXi 5.0.0
10Gb
FAS3270
4 x Xeon E5240
Version 8.1 7-Mode

Can anyone from NetApp comment?

Reply
Doug Dockter says:

March 17, 2013 at 9:05 am

Finally a solution! This has been a thorn in my side for over a year!

Reply
Jude Xavier says:

March 21, 2013 at 7:12 pm

My Take on FAS 3140 and Data Ontap 8.1.1

We have VSephre 4.1, DataOntap 8.1.1 running on Cisco UCS and FAS 3140
When we delete VM snapshot, we had the whole NFS dismounted and remounted. How cool is that?

After 3 months of troubleshooting day and night with Netapp support they pointed me to this inefficient pre-fetching of metadata blocks delays WAFL Consistency Point

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=393877

We want to overcome the performance issue. We want to get the Flash cards and Flashpool
Flashpools not supported
Flash cards can work in half capacity.

All it points back to FAS3140 not able to cope of what Data Ontap 8.1.1 does on the system.
On 7.3.5 we didn’t have these limitations.

FAS 3140 is lower end model with only 2 CPU and 4GB of memory.

Netapp wants us to upgrade to higher end model.

Issue. We have leasing agreement still left for 2 more years and due to Government funding cuts to emergency services, we are in state where the options are very limited.

What explanation you have for us Netapp !!!!

Data Ontap 8.x is not meant for FAS 3140.

Reply
1. Vaughn says:
  
  March 22, 2013 at 3:18 pm
  
  Jude,
  
  Support issues can be frustrating, especially when resolution is difficult to achieve. The behavior you are describing sounds odd. The native VMware snapshot function is based on SCSI transaction logs and can exhibit a number of nuances / unexpected behaviors; what you have described is not one of them.
  
  I’d like to know more. Would you be willing to send me your case number?
  
  Reply
  1. Vaughn says:
    
    March 26, 2013 at 9:33 am
    
    Jude,
    
    I sent this to you privately but am reposting here for the community…
    
    I’d clean up the alignment or get rid of the linked clones if that’s what they’re using.
    Also verify the settings for burt 90314 are in effect since a vmware snap delete is a filer wafl delete.
    
    It’s a 3140 in 7-mode
    The actual number if NFS ops is pretty low based upon – https://latx/viewperf/id/149681
    
    Elpsd CP CP CP CP CP CP CP CP CP CP CP CP CP CP CP
    Time Total Blkd Nrml Timer Snap Bufs Drty Full B2B Flsh Sync vbufs dB2B dVecs mSecs
    ( 293)01/16/013 22:01:02 35 0 35 5 0 0 0 30 0 0 0 0 0 0 102226
    ( 293)01/16/013 22:08:04 53 1 52 0 0 0 0 48 1 0 4 0 0 0 132543
    ( 294)01/16/013 22:14:55 106 2 104 0 0 0 0 104 2 0 0 0 0 0 157300
    ( 293)01/16/013 22:22:20 40 1 39 3 0 0 4 29 1 0 3 0 0 0 120899
    ( 293)01/16/013 22:29:29 31 0 31 13 0 0 0 18 0 0 0 0 0 0 86016
    
    NFS TCP input flow control is engages so the read latency is getting hit:
    
    tcp input flowcontrol receive=30808, xmit=0
    tcp input flowcontrol out, receive=30808, xmit=0
    
    They do have alignment issues: look at all the delta files…….
    And that’s a small sunset..
    The lun stats look fine though so just NFS…..
    
    Files Causing Misaligned IO’s
    [Counter=38505], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
    [Counter=89345], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
    [Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01-000002-delta.vmdk
    [Counter=43], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASISD01/CFASISD01-000003-delta.vmdk
    [Counter=631], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
    [Counter=4], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_2-000001-delta.vmdk
    [Counter=6], Filename=DC1_PROD_VMDK_SAS_03/VMDK_03/CFAEVLT01/CFAEVLT01-flat.vmdk
    [Counter=94], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk
    [Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
    [Counter=1474], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
    [Counter=1055], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
    [Counter=75], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
    [Counter=2], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
    [Counter=19], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
    [Counter=12121], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
    [Counter=255], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
    [Counter=35], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
    [Counter=4688], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
    [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAHPRO02/CFAHPRO02-000003-delta.vmdk
    [Counter=131], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
    [Counter=5], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/ilm1/ilm1-000001-delta.vmdk
    [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAUCMD01/CFAUCMD01-000002-delta.vmdk
    [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFACCAM01/CFACCAM01_2-000001-delta.vmdk
    [Counter=26], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk
    
    Reply
    1. Vaughn says:
      
      March 26, 2013 at 9:35 am
      
      Thanks to the brilliance of John Ferry, NetApp Critical Problem Resolution Escalations Engineer, for the support shared above
      
      Reply
Mike says:

March 26, 2013 at 1:14 am

Had this issue with FAS2020. After going to 7.3.7P1D2 all problems gone. Thanks!!

Reply
Heath says:

March 27, 2013 at 10:10 am

I am having this issue with a FAS2020A right now. Seemed to pop up for the first time about 2 – 3 weeks following our upgrade to ESXi 5.1. The workaround of changing the NFS MaxQueueDepth to 64 had no positive result for us. Currently in the process of upgrading the filer to 7.3.7P1D2. Will advise with results.

Reply
1. Vaughn says:
  
  March 27, 2013 at 5:48 pm
  
  Heath – thanks, looking forward to your results
  
  Reply
  1. Heath says:
    
    March 27, 2013 at 5:59 pm
    
    It’s been running 8 hours with no NFS drops. Looks good so far. Will update again tomorrow.
    
    Reply
    1. Heath says:
      
      March 29, 2013 at 8:35 am
      
      Been running 7.3.7P1D2 on the FAS2020A for 48 hours now. Completely cleared up the disconnects. No negative side effects. Thanks all!
      
      Reply
      1. Vaughn says:
        
        March 29, 2013 at 11:24 am
        
        Awesome!
Dan says:

April 1, 2013 at 6:45 pm

Just ran into this issue today while building out VMware cluster for Oracle. Vaughn – Do you know if this issue has been resolved in any of the 8.1.2P* releases?

Thanks,

Dan

Reply
OZ says:

April 3, 2013 at 8:50 am

Are you sure the a similar problem does not exist on FAS6200 series? We were seeing NFS disconnect\reconnects prior to enabling Storage I/O.

Reply
cng360 says:

April 10, 2013 at 3:11 am

We encountered NetApp bug 393877 and confirmed that 8.12P3 resolved the issue. http://techfailures.wordpress.com/2013/03/28/netapp-fail-2/

Reply
AndyT says:

May 10, 2013 at 7:45 am

ONTAP release 8.1.2P4 has been released today where bug 321428 has been marked as fixed.

Reply
Mike Fluker says:

May 24, 2013 at 9:03 am

Looks like we have run into the same issue about 4weeks after doing an upgrade to ESXi 5.1. We are running FAS2040s with OnTAP 7.2.2P7. We are contacting NetApp now so they can confirm. Then we will attempt to upgrade our filers.

Reply
Jasper Whannell says:

July 31, 2013 at 12:32 pm

Running 5.0U2 and 8.1.2P4 and still experiencing the issue. I am deploying the queue depth change today, fingers crossed.

Reply
Norbert says:

September 17, 2013 at 12:44 am

we just upgraded to 8.1.3P1 and still experience this issue.

Reply
1. Bojan J. says:
  
  December 8, 2013 at 9:08 am
  
  Hi Norbert.
  
  We are even running 8.1.3P2 and the issue is still here. Any progress in the meantime?
  Thanks for sharing.
  
  Bojan
  
  Reply
Keith says:

April 7, 2014 at 7:29 am

What is the fix for a VNX5200 that has been updated to the latest code for file and block?

I am experiencing this issue right now. nfs mount to vnx5200 keeps going grey and (inactive). VM’s seem to be pingable but I cannot mange them in vs client (Unable to connect to the MKS). Ticket has been opened with vmware.

other hosts still show connected to this datastore.

esxcfg-nas -r is not reconnecting it either. That network is pingable.

Reply
1. kjstech says:
  
  May 6, 2014 at 10:43 am
  
  The fix for my situation was upgrade the firmware in Qlogic 10gbe nic cards and then when that was complete, upgrade the esxi qlogic driver (new driver needed new firmware to be stable). Knock on wood, but a few weeks went by and were still running 100% stable.
  
  Reply
Aftab Ahmed says:

May 9, 2014 at 6:32 am

Hi,

Do you know in the update 7.3.7P2 contains the fix in 7.3.7P1D2?

I have an n3600 (rebranded FAS2050) and I only seem to be able to get 7.3.7P2 and a previous comment make me thing I really need 7.3.7P1D2.

Cheers
Aftab

Reply
Aftab Ahmed says:

May 9, 2014 at 6:38 am

Ignore me, I just reread the KB, and will get P2 down and installed.

Cheers

Reply