VNX: NFS datastore intermittently goes offline for a single host

Summary: NFS datastore intermittently goes offline for a single host.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

One or more NFS datastores go into an APD (all paths down) state on a single host at a time. This could happen to different datastores on different hosts or possibly the same datastore on multiple hosts. Generally it is random and intermittent and is solved by either downing and upping the Ethernet ports on the ESXi host or rebooting it. It does not necessarily always happen to the same datastore or the same host.

The key feature of this problem is that the affected datastore or NFS export is still accessible from other hosts. If the datastore is down on all hosts, it is not as likely to be this problem. If it is not solved by downing the network ports or rebooting the host, it is also not going to be this issue.

This affects both VNX1, VNX2, and eNAS products.

 

Cause

VMware support may advise setting the NFS.MaxQueueDepth to 64, but Dell does not have a recommendation for this value currently. However, It is not likely to resolve this specific issue.

Engineering has identified an issue in the way that we handle our TCP send window calculation in some situations. Essentially what happens is that at a certain point the VNX sets the TCP send window value to 0 inappropriately. This disallows the VNX from sending any new data to the host its communicating with on that connection. The VNX can still acknowledge incoming data on the TCP layer, but cannot send any NFS responses.

To our knowledge at this point, we have only seen this behavior affecting ESXi NFS datastores due to a specific way that ESXi performs its TCP acknowledgments at times. At certain points instead of sending along an acknowledgment with its next data packet ESXi uses an additional separate acknowledgment as soon as new data is received from the VNX even if it has data in its transmission queue. This behavior makes the DM believe that the transfer is unidirectional and puts it into header prediction mode. If the ESXi TCP acknowledgment behavior stays consistent while transferring more than 2GB of data from the DM, the DM will slowly reduce the TCP send window to 0 making that particular TCP connection able to only send data in one direction (from the host to the array.). If the data mover receives a data packet with a new ACK number within that 2GB transfer, or there is any packet loss that causes a retransmission then the problem is not encountered.

ESXi runs a heartbeat on the datastore to determine if it is still available. This heartbeat is a GetAttr request to a specific file on the datastore. If it fails some times, the ESXi host marks the datastore as APD. Since the VNX cannot reply to the GetAttr requests from the ESXi host while its TCP send window is set to 0, it marks the datastore as inaccessible. For whatever reason ESXi does not attempt to reset the connection, which would also resolve this problem. That is why doing a reboot or downing and upping the network ports on the host works to restore access.

The TCP send window is calculated separately for each connection. So other datastores remain online provided they have not faced the same condition. The datastore itself is not the problem, so other hosts should still be able to access it unless they have faced the same condition with their connection to this particular datastore.

This issue can be confirmed if there is a packet trace that covers the datastore going from an online to an offline state.

 

Resolution

The TCP send window calculation behaviour will be fixed in a future release of code for both 7.1 and 8.1 code versions (VNX1, VNX2, and eNAS). There is a hotfix available currently, if a fix is needed immediately contact support to request it along with a reboot/failover outage scheduled.

 

Affected Products

VNX1 Series

Products

eNAS, VNX1 Series, VNX2 Series, VNX5100, VNX5150, VNX5200, VNX5300, VNX5400, VNX5500, VNX5600, VNX5700, VNX5800, VNX7600, VNX8000
Article Properties
Article Number: 000055059
Article Type: Solution
Last Modified: 19 May 2025
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.