There is a problem that impacts our mutual customers running VMware ESXi 5 with NFS connectivity. While details are not finalized it appears the engineering teams at VMware and NetApp have identified an issue with the NFS client in ESXi 5 stack and the NFS service in Data ONTAP that results in the two behaving badly with high I/O load, when SIOC is not in use. This issue appears to only affect vSphere 5 releases and not vSphere 4 or VI3 and FAS arrays with less than 2 CPUs, thus it is seen across the FAS2000 series and lower-end systems in the FAS3000 series. I cannot state wether this issue may impact other NFS platforms like EMC isilon, Celerra & VNX.
Massive investments in engineering resources go into assuring quality of product releases and joint solutions; inevitably something falls through the cracks and this is one of those times. For those impacted by this issue, my apologies. The NetApp and VMware engineering teams have been furiously working to identify and resolve this issue. A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds.
Clarifying The Issue:
A NFS datastore disconnect issue displays the following behaviors…
- NFS Datastores are displayed as greyed out and unavailable in vCenter Server or the vSphere client
- Virtual Machines (VMs) on these datastores may hang during these times
- NFS datastore often reappear after a few minutes, which allows VMs to return to normal operation.
- This issue is most often seen after ESXi 5 is introduced into an environment
This issue is documented in VMware KB 2016122 and NetApp Bug 321428
The Fix:
NetApp customers can upgrade Data ONTAP to correct this issue. Versions 7.3.7P1D2 & 8.0.5 have been released and the forth coming 8.1.3 is expected soon. While Data ONTAP upgrades are non-disruptive, they should likely be scheduled for times of reduced I/O activity.
Note: Data ONTAP release families are defined as 7.3.x, 8.0.x, and 8.1.x, with each dot release introducing a new set of features and capabilities. To address a bug, NetApp support suggests applying the DOT version containing the fix based on the DOT installed on your array.
The Workarounds:
While a Data ONTAP upgrade is non-disruptive some VMware administrators may prefer to address the issue immediately to ensure operations. For those interested VMware has published the following:
Workaround Option #1 – Enable SIOC
For those with vSphere Enterprise Plus license, enabling Storage I/O Control will eliminate this issue at it manages the value of MaxQueueDepth.
Workaround Option #2 – Limit MaxQueueDepth
For those without a vSphere Esential Plus license or those who have not enabled SIOC, setting a manual limit on the MaxQueueDepth will prevent the disconnect issue from occurring.
For the step-by-step procedure on how to complete this process in the vSphere Client, vSphere 5 Web Client and on the command line please see VMware KB 2016122.
Considerations of the Workarounds:
I would advise these workarounds be implemented on a temporary basis and remain in place until the NetApp FAS array(s) have been upgraded; at which point these workarounds should be disabled.
The reason for this suggestion is when one implements an I/O limit, such as a queue depth of 64 from the default 4.23 billion, there is a potential of creating faux I/O bottleneck. vSphere is equipped to remedy such issues via data migration technologies like SDRS; however, please note that shuffling data produces a negative impact on storage and networking resources for those using disk-based backups, data deduplication, and data replication with VMware.
I will edit this post should additional information be made available.
I’d like to thank Cormac Hogan for helping raise awareness with his post.
Be advised that none of the current workarounds are sufficient, and that they lower the performance of the storage. Have SIOC turned on and stll gets > 60 sec. hangs during high I/O
Lars, Have you upgraded Data ONTAP? If so, has that addressed your issues?
Vaugn, we are running 8.1.2 7-Mode. Our local NetApp support apparently has no clue about the issue, they are advising us to optimize file layout etc. Will have a chat with them asap 😉
Point them to the bugs in the post, should help 🙂
Just to be clear, it is an NetApp issue. ISCSI clients hangs/disconnects during the hangs as well, and NetApp logs the event in etc/messages log locally
Different issue…
@Vaughn strange, both within 3 seconds, ref NetApp /etc/messages log entries
Mon Jan 28 13:46:46 CET [XXvmware_vfiler@XX-netapp02b: NwkThd_00:warning]: NFS response to client 10.28.0.48 for volume 0x2997b33f(XXvmware) was slow, op was v3 setattr, 140 > 60 (in seconds)
Mon Jan 28 13:46:49 CET [XX-netapp02b:iscsi.notice:notice]: ISCSI: Initiator (iqn.1998-01.com.vmware:XX-vmware01-5014b0da) sent Abort Task Set request, aborting all SCSI commands from this initiator on lun 0
Maybe altogether a new issue then, occuring on several vfilers here. Moving the VM’s to 3Par atm. for some relief….
Vaughn, you mention 7.3.7P2 in your post. The latest version I can see at Netappsite is 7.3.7P1.
Do you have any information about a 7.3.7P2 being released?
/E
I haven’t been able to verify, is the code available now?
It’s 7.3.7P1D2. To get to it, on the SW download page, scroll down to “To access a specific [drop down] version, enter it here:” In the drop down, select Data ONTAP and enter the version. The rest is straight forward.
Erik
I have edited the post – the code you seek is 7.3.7P1D2
Thanks for the heads up!
Looks like EMC got to make code changes to thier benefit.
Pure speculation in your part.
I think it’s more likely that the default EMC BP is to use SIOC, which avoids the issue while also capping IO.
Was just told my Netapp support this bug only applies to FAS 2020. Is that true?
I would expect this issue to be related to an incompatibility condition created by the ESXi 5 NFS client and Data ONTAP and not to a hardware platform.
With that said the NetApp support team are the experts.
I’m not so sure about his explanation and have asked for clarification. The VMware kb specifically says “This issue can occur on any NetApp filer, including the FAS 2000/3000 series.” Also, my host side logs look pretty similar to what’s in the kb. Only difference is when my datastores check out they aren’t reconnecting. Which is problematic.
VMware are not the experts on NetApp hardware or how our code works. They based that statement on seeing the problem on mostly FAS2000 and a small number of FAS3140s and assumed that if it hits 2 families, it hits all of them. Internal NFS code review shows it is not possible to hit this on higher models with 4 or more CPUs. Two cases reporting this issue on higher models were triaged to other issues. (In other words, “This isn’t the bug you’re looking for.” 😉 Our review of the draft VMware KB was before this code review, or we would have told that to the author.
THX Peter
Thanks for the response (although it does put me back to square one on figuring out what happened). Is there a definitive list of which models ARE affected?
Bug notes in link above list the following models:
***************snip*****************
Some recent releases of various vendors’ NFS clients appear to pursue an
aggressive retransmission strategy which may increase the chances that this
problem could occur.
The following storage-controller models are known to be susceptible:
FAS250
FAS270
FAS940
FAS960
FAS2020
FAS2040
FAS2050
FAS3020
FAS3040
FAS3050
FAS3140
FAS3210
FAS6030
FAS6040
N3300
N3400
N3600
N3700
N5200
N5300
N5500
N6040
N6210
N7600
N7700
A so-called “partial fix” in some releases is not effective to prevent
several of the modes of failure.
A related bug is 654196.
FYI Just to clarify why enabling SIOC may not be sufficient/does not prevent the issue/hang
1) If the filer is running other workloads, sharing NFS to non SIOC clients and/or
2) If/when the filer runs internal jobs like dedup/snapshot cleanup
Then the following may kick in as a warning at the vcenter alarm view
Unmanaged I/O workload detected on shared datastore running Storage I/O Control (SIOC) ”
vmware then briefly disables SIOC, see below why and the filer may hang.
Throttling in this case would result in the external workload getting more and more bandwidth, while the vSphere workloads get less and less. Therefore, SIOC detects the presence of such external workloads, and as long as they are present while the threshold is being exceeded, SIOC competes with the interfering workload by reducing its usual throttling activity.
Hi,
We have a customer running 5.1 and ONTAP 8.1.1p1.
They have these connectivity issues most likely due to NFS queue depth.
Reading the KB´s we adviced them to enable SIOC on the datastores.
This did not solve the issue, my question now is..
Customer wants to have SIOC enabled, BUT it seems that he also needs to set the NFS.MaxQueueDepth to 64 to avoid this issue.
What are the consequenses of having SIOC enabled and NFS.MaxQueueDepth set to 64?
Will SIOC still compete for resources?
Is this not a recommended combination of settings?
BR
Linus
Linus,
My understanding is NetApp Customer Success Services (CSS) is planning to address this issue in the 8.1 code line beginning with 8.1.2p4.
The workarounds carry a potential downside of increased SDRS migrations (due to IO greater than the queue limit). As stated in the post, I’d implement the workarounds until the patch is available.
Vaughn, when will 8.1.2p4 be available??? The workarounds are no option for our customer. New MetroCluster installation with vSphere 5.1. What are your suggestions?
Vaughn, when will 8.1.2p4 or better 8.1.3 be available??? We are talking about weeks or months?
I’m experiencing similar disconnects in my environment, but our stats don’t fall under the acknowledged affected devices.
ESXi 5.0.0
10Gb
FAS3270
4 x Xeon E5240
Version 8.1 7-Mode
Can anyone from NetApp comment?
Finally a solution! This has been a thorn in my side for over a year!
My Take on FAS 3140 and Data Ontap 8.1.1
We have VSephre 4.1, DataOntap 8.1.1 running on Cisco UCS and FAS 3140
When we delete VM snapshot, we had the whole NFS dismounted and remounted. How cool is that?
After 3 months of troubleshooting day and night with Netapp support they pointed me to this inefficient pre-fetching of metadata blocks delays WAFL Consistency Point
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=393877
We want to overcome the performance issue. We want to get the Flash cards and Flashpool
Flashpools not supported
Flash cards can work in half capacity.
All it points back to FAS3140 not able to cope of what Data Ontap 8.1.1 does on the system.
On 7.3.5 we didn’t have these limitations.
FAS 3140 is lower end model with only 2 CPU and 4GB of memory.
Netapp wants us to upgrade to higher end model.
Issue. We have leasing agreement still left for 2 more years and due to Government funding cuts to emergency services, we are in state where the options are very limited.
What explanation you have for us Netapp !!!!
Data Ontap 8.x is not meant for FAS 3140.
Jude,
Support issues can be frustrating, especially when resolution is difficult to achieve. The behavior you are describing sounds odd. The native VMware snapshot function is based on SCSI transaction logs and can exhibit a number of nuances / unexpected behaviors; what you have described is not one of them.
I’d like to know more. Would you be willing to send me your case number?
Jude,
I sent this to you privately but am reposting here for the community…
I’d clean up the alignment or get rid of the linked clones if that’s what they’re using.
Also verify the settings for burt 90314 are in effect since a vmware snap delete is a filer wafl delete.
It’s a 3140 in 7-mode
The actual number if NFS ops is pretty low based upon – https://latx/viewperf/id/149681
Elpsd CP CP CP CP CP CP CP CP CP CP CP CP CP CP CP
Time Total Blkd Nrml Timer Snap Bufs Drty Full B2B Flsh Sync vbufs dB2B dVecs mSecs
( 293)01/16/013 22:01:02 35 0 35 5 0 0 0 30 0 0 0 0 0 0 102226
( 293)01/16/013 22:08:04 53 1 52 0 0 0 0 48 1 0 4 0 0 0 132543
( 294)01/16/013 22:14:55 106 2 104 0 0 0 0 104 2 0 0 0 0 0 157300
( 293)01/16/013 22:22:20 40 1 39 3 0 0 4 29 1 0 3 0 0 0 120899
( 293)01/16/013 22:29:29 31 0 31 13 0 0 0 18 0 0 0 0 0 0 86016
NFS TCP input flow control is engages so the read latency is getting hit:
tcp input flowcontrol receive=30808, xmit=0
tcp input flowcontrol out, receive=30808, xmit=0
They do have alignment issues: look at all the delta files…….
And that’s a small sunset..
The lun stats look fine though so just NFS…..
Files Causing Misaligned IO’s
[Counter=38505], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
[Counter=89345], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
[Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01-000002-delta.vmdk
[Counter=43], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASISD01/CFASISD01-000003-delta.vmdk
[Counter=631], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
[Counter=4], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_2-000001-delta.vmdk
[Counter=6], Filename=DC1_PROD_VMDK_SAS_03/VMDK_03/CFAEVLT01/CFAEVLT01-flat.vmdk
[Counter=94], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk
[Counter=12], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
[Counter=1474], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
[Counter=1055], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
[Counter=75], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
[Counter=2], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAIDNS01/CFAIDNS01-flat.vmdk
[Counter=19], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/CFANOPM01/CFANOPM01-000004-delta.vmdk
[Counter=12121], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAISPI01/CFAISPI01_1-000002-delta.vmdk
[Counter=255], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAMSPA03/CFAMSPA03-000001-delta.vmdk
[Counter=35], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000003-delta.vmdk
[Counter=4688], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/NCS01/NCS01-flat.vmdk
[Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAHPRO02/CFAHPRO02-000003-delta.vmdk
[Counter=131], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFASQLS05/CFASQLS05_1-000001-delta.vmdk
[Counter=5], Filename=DC1_PROD_VMDK_SAS_02/VMDK_02/ilm1/ilm1-000001-delta.vmdk
[Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFAUCMD01/CFAUCMD01-000002-delta.vmdk
[Counter=1], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFACCAM01/CFACCAM01_2-000001-delta.vmdk
[Counter=26], Filename=DC1_PROD_VMDK_SAS_01/VMDK_01/CFABSMD01/CFABSMD01_1-000001-delta.vmdk
Thanks to the brilliance of John Ferry, NetApp Critical Problem Resolution Escalations Engineer, for the support shared above
Had this issue with FAS2020. After going to 7.3.7P1D2 all problems gone. Thanks!!
I am having this issue with a FAS2020A right now. Seemed to pop up for the first time about 2 – 3 weeks following our upgrade to ESXi 5.1. The workaround of changing the NFS MaxQueueDepth to 64 had no positive result for us. Currently in the process of upgrading the filer to 7.3.7P1D2. Will advise with results.
Heath – thanks, looking forward to your results
It’s been running 8 hours with no NFS drops. Looks good so far. Will update again tomorrow.
Been running 7.3.7P1D2 on the FAS2020A for 48 hours now. Completely cleared up the disconnects. No negative side effects. Thanks all!
Awesome!
Just ran into this issue today while building out VMware cluster for Oracle. Vaughn – Do you know if this issue has been resolved in any of the 8.1.2P* releases?
Thanks,
Dan
Are you sure the a similar problem does not exist on FAS6200 series? We were seeing NFS disconnect\reconnects prior to enabling Storage I/O.
We encountered NetApp bug 393877 and confirmed that 8.12P3 resolved the issue. http://techfailures.wordpress.com/2013/03/28/netapp-fail-2/
ONTAP release 8.1.2P4 has been released today where bug 321428 has been marked as fixed.
Looks like we have run into the same issue about 4weeks after doing an upgrade to ESXi 5.1. We are running FAS2040s with OnTAP 7.2.2P7. We are contacting NetApp now so they can confirm. Then we will attempt to upgrade our filers.
Running 5.0U2 and 8.1.2P4 and still experiencing the issue. I am deploying the queue depth change today, fingers crossed.
we just upgraded to 8.1.3P1 and still experience this issue.
Hi Norbert.
We are even running 8.1.3P2 and the issue is still here. Any progress in the meantime?
Thanks for sharing.
Bojan
What is the fix for a VNX5200 that has been updated to the latest code for file and block?
I am experiencing this issue right now. nfs mount to vnx5200 keeps going grey and (inactive). VM’s seem to be pingable but I cannot mange them in vs client (Unable to connect to the MKS). Ticket has been opened with vmware.
other hosts still show connected to this datastore.
esxcfg-nas -r is not reconnecting it either. That network is pingable.
The fix for my situation was upgrade the firmware in Qlogic 10gbe nic cards and then when that was complete, upgrade the esxi qlogic driver (new driver needed new firmware to be stable). Knock on wood, but a few weeks went by and were still running 100% stable.
Hi,
Do you know in the update 7.3.7P2 contains the fix in 7.3.7P1D2?
I have an n3600 (rebranded FAS2050) and I only seem to be able to get 7.3.7P2 and a previous comment make me thing I really need 7.3.7P1D2.
Cheers
Aftab
Ignore me, I just reread the KB, and will get P2 down and installed.
Cheers