One of my customer use VNX2 File for these VMWARE servers. During Datamover Failover, they encountered some problem with VM causing by the timeout.
Our EMC Best Practices documents are based on VMWARE values for NFS (see bellow). But, after calculate these values, you understand that the total NFS timeout is 125 seconds. After 125 seconds, the NFS datastore is down and the VM is freeze or stopped.
We know that VNX datamover failover delay can be between 2 or 3 min (150-180 seconds), in this situation, the NFS Datastore will be down before the failover is completed and VM will be down even if the VM Disk timeout is configured with 180 seconds.
What is our solution to ensure the datastore will be still up after the complete failover.
Thanks for your help.
VMWARE NFS values
NFS.HeartbeatTimeout = 12
NFS.HeartbeatMaxFailures = 5
NFS HeartbeatDelta = 5
NFS.HeartbeatFrequency = 10
Minimum time to mark a volume down(minTime) = (HeartbeatFrequency * (HeartbeatMaxFailures - 1)) + HeartbeatTimeout
Maximum time to mark a volume down(maxTime) = HeartbeatDelta + HeartbeatFrequency + minTime
For a VNX2, datamover failover time should be somewhere in the sub-60-second range. Has the customer engaged with EMC Support to determine if there are other issues with their VNX2 array that could be contributing to the long failover?
Do you get any error message on VNX check the error code if they are there. Check whether if Control Station gets rebooted at the time of Failover or if it gets in hung condition. Once you get the error code at storage end, check that code with below command (from storage management station) :
Type nas_message -info <MessageID>
Where <MessageID> is the message identification number.
You should get the exact issue, if it's happening from Storage end. Check the failover policy as well.
I hate to ask this because it's not immediately helpful-- but is there any way you can move them down the path to leveraging block storage off the VNX instead?
If not, as suggested above, work with EMC Support to determine why the failover time is taking longer than 60-90 seconds.
First, I thank you for your answers. A SR is open for this problem but I don't have the good answer for now.
@James, in the real life, datamovers failover time is never less than 90-120 seconds (if there is no a lot of FS/Checkpoint).
It is maybe faster to failover (switchover) to a second VNX with replicator. Netapp recommend 180 second timeout.
@How can we explain to our customers that they can't use VMWARE on NFS ... only for this reason...
This should not be an answer. EMC should not be known as a hindrance in NAS due to its failover time-line, they really have to shift a bit of focus on NAS (with unified storage).
You can have below explanation :
1) The Standard heart-beat time for communication between 2 Data Mover is (to get a FS failed-over to other DM) max 120-180 seconds, in the industry.
2) This may take longer, if FS (in this case Datastore) is overloaded with Snaps, their delta sets may take longer to get transferred to other DM, depending on the size of Datastore (you also need to give figures).
You can also upgrade DART on VNX, should that make any difference. Few Files can be moved to other Datastore (must be mounted on separate DM) for load balancing at the time of failover (if that is the case, as described by you above and answer given by EMC Tech).
When a customer is looking to buy or already owns a VNX, the conversation around which protocol to use for a VMware environment is a fairly common one to have. When I speak with a customer about this, failover time is ALWAYS a conversation, and nearly always leans them towards going block (iSCSI in the case of Ethernet).
Then they use the Data Mover functionality for CIFS shares-- because a 60-120 second outage for file services tends to be much more tolerable than dumping an NFS datastore running X number of VMs.
All that is not to say you CAN'T use NFS of the VNX DMs for VMware, but if you do, you need to be very thorough in the testing/validation of failover during the implementation for the exact reasons outlined above.
Lastly, there are rumors that NFSv4 will be available in vSphere.next. Can't confirm if that's true or not, but there are advantages with NFSv4 around network/interface failover as compared to the current implementation all the way up to 5.5.