Realworld HA Expectations (vmfs)

Question

We have recently implimented all of the suggestions in the available VNXe High Availability document. I am getting ready to plan our upgrade to MR2 on our VNXe3100 and was currious on what we can actually expect for our iSCSI VMFS volumes. I have seen some discusions that the actual failover isnt "instant" and can take perhaps 30sec-2minutes.

From a VirtualMachine standpoint, what should we expect when running Software Upgrades on VNXe? Can I continue to have the virtual machines running and not expect any issues on our vSphere Cluster accessing the datastores while the storage processers are rebooted? Or should I do as we previoulsy and power down all the VMs and then run the VNXe software upgrade, reboot the storage processors and then power up the VMs?

Sriky · Accepted Answer

@Nadia:

I am not sure as to what were the changes that were done on the unisphere but it is not recommended to make any changes to the iSCSI server when a session is established between target and the initiator.

We have to note that during upgrade(graceful process) each SP will reboot twice and uses cluster resources to failover and might take 30-120sec for the services to failover to the peer SP, However if you want to test the realtime failover you may try disconnecting the eth adaptor or remove the SP out where the iSCSI server is running and the failover will takeplace instantenously.

In a vSphere environment please makesure that VMware tools are installed on all the virtual machines which by default increases the disktimeout value to 180 sec (ESX 4.0 and later).

@captainflannel:For High demanding machines it is better to take a downtime as increasing the timeout value will increase the recovery time and this is best recommended for machines that run medium/low load.

Regards,

Sri

Sriky · Answer

Hi,

VNXe is designed to have a maximum uptime with its HA features.

There are possibilities for ESX server to timeout when an SP reboots and fails over to the other SP.(As the upgrade is a graceful process)
You can tweak into the ESX server and increase the SCSI timeout value by editing vmkiscsid.db file using sqlite3.
also the disktimeout value has to be increased on the VM's registry as well.
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Disk\TimeOutValue
by default the value is 60 sec,
If you prefer taking a downtime and upgrading the box then:
1.Poweroff the virtual machines running on VNXe.
2.Begin upgrade on the VNXe and after the upgrade is succesful there is no need for rebooting the SP's once again you may just power on the VM's.

Regards,

Sri

studionc · Answer

It is a really hot question as of today we don't feel confident to execute an upgrade with live VMs on the datastores.

(Actually we are scared applying any change in unisphere... as in the past doing a minor configurarion we ended up having iscsi server unresponsive.....)

I still don't understand if HA is effectively working or do we need to tweak parameters ?

It's not feasible to modify each registry of the VMs in real production environment.

HA should work problemless as per design but sadly it is not the case in our experience.

EMC should kindly take an official position about real HA behaviour.

Regards,

Nadia

captainflannel · Answer

With the suggestesion of changing the timeoutvalue regestry setting, what value would be suggested? 180 Seconds? So this would allow for 3 minutes for the storage processors to failover?

I also would be currious if anyone has seen any performance differences in high demanding VMs with such a reg change. Its easy enough to push this change out via Script/GPO but still does not fully answer my question.

Thanks.

engineering-rel · Answer

@captainflannel: We were advised by EMC support inresponse to a ticket on our VNXe3100 to increase the HA timeout on our XenServers to 120 seconds from the XenServer default of 30 seconds. In testing we noted SP failovers took as little as 37 seconds to as much as 1.5 minutes.

Hope that helps.

captainflannel · Answer

Just updating my original post to with our experiences.  Prior to the latest OS release we completly implimented all of the recommendations in the provided EMC HA guide for VNXe for iSCSI.  I then performed the OS upgrade after hours.  I wanted to see what would happen, so I did not shutdown any Virtual Machines, including SQL (We had backups! )  During the SP reboots all of our VMFS datastores stayed online, and our Virtual Machines stayed online, perhaps a few timeouts or pauses within the Operating System, but no vSphere Failovers or Alerts.  We did have a single Linux VM (VM Tools intalled and up to date) which did have a vSphere Failover and failed during the upgrade.  But out of our 50+ VM infrastructure I'm ok with that.  Once OS upgrade was complete, the linux machine mentioned powered on without issue.

studionc · Answer

Hi, thanks for the update. Could you please tell me exactly what recommendations you implemented ? (maybe I'm missing some point)

VNX

Realworld HA Expectations (vmfs)

Was this post helpful?