Heads up! Avoiding VMware vSphere ESXi 5 NFS Disconnect Issues

-

NFSBUG

There is a problem that impacts our mutual customers running VMware ESXi 5 with NFS connectivity. While details are not finalized it appears the engineering teams at VMware and NetApp have identified an issue with the NFS client in ESXi 5 stack and the NFS service in Data ONTAP that results in the two behaving badly with high I/O load, when SIOC is not in use. This issue appears to only affect vSphere 5 releases and not vSphere 4 or VI3 and FAS arrays with less than 2 CPUs, thus it is seen across the FAS2000 series and lower-end systems in the FAS3000 series. I cannot state wether this issue may impact other NFS platforms like EMC isilon, Celerra & VNX.

Massive investments in engineering resources go into assuring quality of product releases and joint solutions; inevitably something falls through the cracks and this is one of those times. For those impacted by this issue, my apologies. The NetApp and VMware engineering teams have been furiously working to identify and resolve this issue. A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds.

 

Clarifying The Issue:

A NFS datastore disconnect issue displays the following behaviors…

  • NFS Datastores are displayed as greyed out and unavailable in vCenter Server or the vSphere client
  • Virtual Machines (VMs) on these datastores may hang during these times
  • NFS datastore often reappear after a few minutes, which allows VMs to return to normal operation.
  • This issue is most often seen after ESXi 5 is introduced into an environment

This issue is documented in VMware KB 2016122 and NetApp Bug 321428

 

The Fix:

NetApp customers can upgrade Data ONTAP to correct this issue. Versions 7.3.7P1D2 & 8.0.5 have been released and the forth coming 8.1.3 is expected soon. While Data ONTAP upgrades are non-disruptive, they should likely be scheduled for times of reduced I/O activity.

Note: Data ONTAP release families are defined as 7.3.x, 8.0.x, and 8.1.x, with each dot release introducing a new set of features and capabilities. To address a bug, NetApp support suggests applying the DOT version containing the fix based on the DOT installed on your array.

 

The Workarounds:

While a Data ONTAP upgrade is non-disruptive some VMware administrators may prefer to address the issue immediately to ensure operations. For those interested VMware has published the following:

Workaround Option #1 – Enable SIOC

For those with vSphere Enterprise Plus license, enabling Storage I/O Control will eliminate this issue at it manages the value of MaxQueueDepth.

Workaround Option #2 – Limit MaxQueueDepth

For those without a vSphere Esential Plus license or those who have not enabled SIOC, setting a manual limit on the MaxQueueDepth will prevent the disconnect issue from occurring.

For the step-by-step procedure on how to complete this process in the vSphere Client, vSphere 5 Web Client and on the command line please see VMware KB 2016122.

 

Considerations of the Workarounds:

I would advise these workarounds be implemented on a temporary basis and remain in place until the NetApp FAS array(s) have been upgraded; at which point these workarounds should be disabled.

The reason for this suggestion is when one implements an I/O limit, such as a queue depth of 64 from the default 4.23 billion, there is a potential of creating faux I/O bottleneck. vSphere is equipped to remedy such issues via data migration technologies like SDRS; however, please note that shuffling data produces a negative impact on storage and networking resources for those using disk-based backups, data deduplication, and data replication with VMware.

I will edit this post should additional information be made available.

I’d like to thank Cormac Hogan for helping raise awareness with his post.

Vaughn Stewart
Vaughn Stewarthttp://twitter.com/vStewed
Vaughn is a VP of Systems Engineering at VAST Data. He helps organizations capitalize on what’s possible from VAST’s Universal Storage in a multitude of environments including A.I. & deep learning, data analytics, animation & VFX, media & broadcast, health & life sciences, data protection, etc. He spent 23 years in various leadership roles at Pure Storage and NetApp, and has been awarded a U.S. patent. Vaughn strives to simplify the technically complex and advocates thinking outside the box. You can find his perspective online at vaughnstewart.com and in print; he’s coauthored multiple books including “Virtualization Changes Everything: Storage Strategies for VMware vSphere & Cloud Computing“.

Share this article

Recent posts

50 Comments

  1. Be advised that none of the current workarounds are sufficient, and that they lower the performance of the storage. Have SIOC turned on and stll gets > 60 sec. hangs during high I/O

  2. Vaugn, we are running 8.1.2 7-Mode. Our local NetApp support apparently has no clue about the issue, they are advising us to optimize file layout etc. Will have a chat with them asap 😉

        • @Vaughn strange, both within 3 seconds, ref NetApp /etc/messages log entries

          Mon Jan 28 13:46:46 CET [XXvmware_vfiler@XX-netapp02b: NwkThd_00:warning]: NFS response to client 10.28.0.48 for volume 0x2997b33f(XXvmware) was slow, op was v3 setattr, 140 > 60 (in seconds)
          Mon Jan 28 13:46:49 CET [XX-netapp02b:iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:XX-vmware01-5014b0da) sent Abort Task Set request, aborting all SCSI commands from this initiator on lun 0

          Maybe altogether a new issue then, occuring on several vfilers here. Moving the VM’s to 3Par atm. for some relief….

  3. Vaughn, you mention 7.3.7P2 in your post. The latest version I can see at Netappsite is 7.3.7P1.
    Do you have any information about a 7.3.7P2 being released?
    /E

    • I would expect this issue to be related to an incompatibility condition created by the ESXi 5 NFS client and Data ONTAP and not to a hardware platform.

      With that said the NetApp support team are the experts.

      • I’m not so sure about his explanation and have asked for clarification. The VMware kb specifically says “This issue can occur on any NetApp filer, including the FAS 2000/3000 series.” Also, my host side logs look pretty similar to what’s in the kb. Only difference is when my datastores check out they aren’t reconnecting. Which is problematic.

        • VMware are not the experts on NetApp hardware or how our code works. They based that statement on seeing the problem on mostly FAS2000 and a small number of FAS3140s and assumed that if it hits 2 families, it hits all of them. Internal NFS code review shows it is not possible to hit this on higher models with 4 or more CPUs. Two cases reporting this issue on higher models were triaged to other issues. (In other words, “This isn’t the bug you’re looking for.” 😉 Our review of the draft VMware KB was before this code review, or we would have told that to the author.

          • Thanks for the response (although it does put me back to square one on figuring out what happened). Is there a definitive list of which models ARE affected?

          • Bug notes in link above list the following models:
            ***************snip*****************
            Some recent releases of various vendors’ NFS clients appear to pursue an
            aggressive retransmission strategy which may increase the chances that this
            problem could occur.

            The following storage-controller models are known to be susceptible:

            FAS250
            FAS270
            FAS940
            FAS960
            FAS2020
            FAS2040
            FAS2050
            FAS3020
            FAS3040
            FAS3050
            FAS3140
            FAS3210
            FAS6030
            FAS6040
            N3300
            N3400
            N3600
            N3700
            N5200
            N5300
            N5500
            N6040
            N6210
            N7600
            N7700

            A so-called “partial fix” in some releases is not effective to prevent
            several of the modes of failure.

            A related bug is 654196.

  4. FYI Just to clarify why enabling SIOC may not be sufficient/does not prevent the issue/hang

    1) If the filer is running other workloads, sharing NFS to non SIOC clients and/or
    2) If/when the filer runs internal jobs like dedup/snapshot cleanup

    Then the following may kick in as a warning at the vcenter alarm view
    Unmanaged I/O workload detected on shared datastore running Storage I/O Control (SIOC) ”

    vmware then briefly disables SIOC, see below why and the filer may hang.
    Throttling in this case would result in the external workload getting more and more bandwidth, while the vSphere workloads get less and less. Therefore, SIOC detects the presence of such external workloads, and as long as they are present while the threshold is being exceeded, SIOC competes with the interfering workload by reducing its usual throttling activity.

  5. Hi,

    We have a customer running 5.1 and ONTAP 8.1.1p1.
    They have these connectivity issues most likely due to NFS queue depth.

    Reading the KB´s we adviced them to enable SIOC on the datastores.
    This did not solve the issue, my question now is..

    Customer wants to have SIOC enabled, BUT it seems that he also needs to set the NFS.MaxQueueDepth to 64 to avoid this issue.
    What are the consequenses of having SIOC enabled and NFS.MaxQueueDepth set to 64?
    Will SIOC still compete for resources?
    Is this not a recommended combination of settings?

    BR
    Linus

    • Linus,

      My understanding is NetApp Customer Success Services (CSS) is planning to address this issue in the 8.1 code line beginning with 8.1.2p4.

      The workarounds carry a potential downside of increased SDRS migrations (due to IO greater than the queue limit). As stated in the post, I’d implement the workarounds until the patch is available.

      • Vaughn, when will 8.1.2p4 be available??? The workarounds are no option for our customer. New MetroCluster installation with vSphere 5.1. What are your suggestions?

  6. I’m experiencing similar disconnects in my environment, but our stats don’t fall under the acknowledged affected devices.

    ESXi 5.0.0
    10Gb
    FAS3270
    4 x Xeon E5240
    Version 8.1 7-Mode

    Can anyone from NetApp comment?

  7. My Take on FAS 3140 and Data Ontap 8.1.1

    We have VSephre 4.1, DataOntap 8.1.1 running on Cisco UCS and FAS 3140
    When we delete VM snapshot, we had the whole NFS dismounted and remounted. How cool is that?

    After 3 months of troubleshooting day and night with Netapp support they pointed me to this inefficient pre-fetching of metadata blocks delays WAFL Consistency Point

    http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=393877

    We want to overcome the performance issue. We want to get the Flash cards and Flashpool
    Flashpools not supported
    Flash cards can work in half capacity.

    All it points back to FAS3140 not able to cope of what Data Ontap 8.1.1 does on the system.
    On 7.3.5 we didn’t have these limitations.

    FAS 3140 is lower end model with only 2 CPU and 4GB of memory.

    Netapp wants us to upgrade to higher end model.

    Issue. We have leasing agreement still left for 2 more years and due to Government funding cuts to emergency services, we are in state where the options are very limited.

    What explanation you have for us Netapp !!!!

    Data Ontap 8.x is not meant for FAS 3140.

    • Jude,

      Support issues can be frustrating, especially when resolution is difficult to achieve. The behavior you are describing sounds odd. The native VMware snapshot function is based on SCSI transaction logs and can exhibit a number of nuances / unexpected behaviors; what you have described is not one of them.

      I’d like to know more. Would you be willing to send me your case number?

      • Jude,

        I sent this to you privately but am reposting here for the community…

        I’d clean up the alignment or get rid of the linked clones if that’s what they’re using.
        Also verify the settings for burt 90314 are in effect since a vmware snap delete is a filer wafl delete.

        It’s a 3140 in 7-mode
        The actual number if NFS ops is pretty low based upon – https://latx/viewperf/id/149681

        Elpsd CP CP CP CP CP CP CP CP CP CP CP CP CP CP CP
        Time Total Blkd Nrml Timer Snap Bufs Drty Full B2B Flsh Sync vbufs dB2B dVecs mSecs
        ( 293)01/16/013 22:01:02 35 0 35 5 0 0 0 30 0 0 0 0 0 0 102226
        ( 293)01/16/013 22:08:04 53 1 52 0 0 0 0 48 1 0 4 0 0 0 132543
        ( 294)01/16/013 22:14:55 106 2 104 0 0 0 0 104 2 0 0 0 0 0 157300
        ( 293)01/16/013 22:22:20 40 1 39 3 0 0 4 29 1 0 3 0 0 0 120899
        ( 293)01/16/013 22:29:29 31 0 31 13 0 0 0 18 0 0 0 0 0 0 86016

        NFS TCP input flow control is engages so the read latency is getting hit:

        tcp input flowcontrol receive=30808, xmit=0
        tcp input flowcontrol out, receive=30808, xmit=0

        They do have alignment issues: look at all the delta files…….
        And that’s a small sunset..
        The lun stats look fine though so just NFS…..

        Files Causing Misaligned IO’s
        [Counter=38505], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
        [Counter=89345], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
        [Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01-000002-delta.vmdk
        [Counter=43], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASISD01/CFASISD01-000003-delta.vmdk
        [Counter=631], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
        [Counter=4], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_2-000001-delta.vmdk
        [Counter=6], Filename=DC1_PROD_VMDK_SAS_03/VMDK_03/CFAEVLT01/CFAEVLT01-flat.vmdk
        [Counter=94], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk
        [Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
        [Counter=1474], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
        [Counter=1055], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
        [Counter=75], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
        [Counter=2], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
        [Counter=19], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
        [Counter=12121], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
        [Counter=255], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
        [Counter=35], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
        [Counter=4688], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
        [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAHPRO02/CFAHPRO02-000003-delta.vmdk
        [Counter=131], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
        [Counter=5], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/ilm1/ilm1-000001-delta.vmdk
        [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAUCMD01/CFAUCMD01-000002-delta.vmdk
        [Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFACCAM01/CFACCAM01_2-000001-delta.vmdk
        [Counter=26], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk

  8. I am having this issue with a FAS2020A right now. Seemed to pop up for the first time about 2 – 3 weeks following our upgrade to ESXi 5.1. The workaround of changing the NFS MaxQueueDepth to 64 had no positive result for us. Currently in the process of upgrading the filer to 7.3.7P1D2. Will advise with results.

  9. Just ran into this issue today while building out VMware cluster for Oracle. Vaughn – Do you know if this issue has been resolved in any of the 8.1.2P* releases?

    Thanks,

    Dan

  10. Are you sure the a similar problem does not exist on FAS6200 series? We were seeing NFS disconnect\reconnects prior to enabling Storage I/O.

  11. Looks like we have run into the same issue about 4weeks after doing an upgrade to ESXi 5.1. We are running FAS2040s with OnTAP 7.2.2P7. We are contacting NetApp now so they can confirm. Then we will attempt to upgrade our filers.

  12. Running 5.0U2 and 8.1.2P4 and still experiencing the issue. I am deploying the queue depth change today, fingers crossed.

  13. What is the fix for a VNX5200 that has been updated to the latest code for file and block?

    I am experiencing this issue right now. nfs mount to vnx5200 keeps going grey and (inactive). VM’s seem to be pingable but I cannot mange them in vs client (Unable to connect to the MKS). Ticket has been opened with vmware.

    other hosts still show connected to this datastore.

    esxcfg-nas -r is not reconnecting it either. That network is pingable.

    • The fix for my situation was upgrade the firmware in Qlogic 10gbe nic cards and then when that was complete, upgrade the esxi qlogic driver (new driver needed new firmware to be stable). Knock on wood, but a few weeks went by and were still running 100% stable.

  14. Hi,

    Do you know in the update 7.3.7P2 contains the fix in 7.3.7P1D2?

    I have an n3600 (rebranded FAS2050) and I only seem to be able to get 7.3.7P2 and a previous comment make me thing I really need 7.3.7P1D2.

    Cheers
    Aftab

Leave a Reply

Recent comments

Jens Melhede (Violin Memory) on I’m Joining the Flash Revolution. How about You?
Andris Masengi Indonesia on One Week with an iPad
Just A Storage Guy on Myth Busting: Storage Guarantees
Stavstud on One Week with an iPad
Packetboy on One Week with an iPad
Andrew Mitchell on One Week with an iPad
Keith Norbie on One Week with an iPad
Adrian Simays on EMC Benchmarking Shenanigans
Brian on Welcome…