There is a problem that impacts our mutual customers running VMware ESXi 5 with NFS connectivity. While details are not finalized it appears the engineering teams at VMware and NetApp have identified an issue with the NFS client in ESXi 5 stack and the NFS service in Data ONTAP that results in the two behaving badly with high I/O load, when SIOC is not in use. This issue appears to only affect vSphere 5 releases and not vSphere 4 or VI3 and FAS arrays with less than 2 CPUs, thus it is seen across the FAS2000 series and lower-end systems in the FAS3000 series. I cannot state wether this issue may impact other NFS platforms like EMC isilon, Celerra & VNX.
Massive investments in engineering resources go into assuring quality of product releases and joint solutions; inevitably something falls through the cracks and this is one of those times. For those impacted by this issue, my apologies. The NetApp and VMware engineering teams have been furiously working to identify and resolve this issue. A fix has been released by NetApp engineering and for those unable to upgrade their storage controllers, VMware engineering has published a pair of workarounds.
Clarifying The Issue:
A NFS datastore disconnect issue displays the following behaviors…
NetApp customers can upgrade Data ONTAP to correct this issue. Versions 7.3.7P1D2 & 8.0.5 have been released and the forth coming 8.1.3 is expected soon. While Data ONTAP upgrades are non-disruptive, they should likely be scheduled for times of reduced I/O activity.
Note: Data ONTAP release families are defined as 7.3.x, 8.0.x, and 8.1.x, with each dot release introducing a new set of features and capabilities. To address a bug, NetApp support suggests applying the DOT version containing the fix based on the DOT installed on your array.
While a Data ONTAP upgrade is non-disruptive some VMware administrators may prefer to address the issue immediately to ensure operations. For those interested VMware has published the following:
Workaround Option #1 – Enable SIOC
For those with vSphere Enterprise Plus license, enabling Storage I/O Control will eliminate this issue at it manages the value of MaxQueueDepth.
Workaround Option #2 – Limit MaxQueueDepth
For those without a vSphere Esential Plus license or those who have not enabled SIOC, setting a manual limit on the MaxQueueDepth will prevent the disconnect issue from occurring.
For the step-by-step procedure on how to complete this process in the vSphere Client, vSphere 5 Web Client and on the command line please see VMware KB 2016122.
Considerations of the Workarounds:
I would advise these workarounds be implemented on a temporary basis and remain in place until the NetApp FAS array(s) have been upgraded; at which point these workarounds should be disabled.
The reason for this suggestion is when one implements an I/O limit, such as a queue depth of 64 from the default 4.23 billion, there is a potential of creating faux I/O bottleneck. vSphere is equipped to remedy such issues via data migration technologies like SDRS; however, please note that shuffling data produces a negative impact on storage and networking resources for those using disk-based backups, data deduplication, and data replication with VMware.
I will edit this post should additional information be made available.