I’m noticing some unexpected network behavior on some of my HyperV guests after performing live migrations.
The HyperV environment:
5-node Windows Server 2008 R2 SP1 HyperV Failover Cluster, hosting 14 guests
Each host server has the following network configuration, each on separate subnets:
1 NIC for dedicated host management
1 NIC for Cluster Shared Volume network
1 NIC for Cluster heartbeat
1 NIC for Live Migration network
2 NICs for host iSCSI using Dell EqualLogic HIT Kit for MPIO, etc
1 NIC for HyperV switch
2 NICs for iSCSI within HyperV guests
I can reproduce the following behavior fairly consistently, although it does not happen every time:
Things I’ve investigated:
It seems the iSCSIPrt errors and Failover Clustering errors merely result from the longer than expected “Pending” status change. But, I have no idea how to further troubleshoot what could be causing that behavior during the Live Migration.
Any suggestions are much appreciated.
You said 2 host NICs are dedicated to iSCSI for guests... are they connected to two separate vSwitches and each VM connected to the two switches or are both NICs teamed and connected to a single vSwitch?
And do all the NICs and switch ports and in the storage path use Jumbo Frames?
FWIW, I use clustered guests with Equallogic storage without problems. I do, however, recall encountering short hangs when live migrating back when the configuration was validated.
It turned out to be a glitch with VMQ support of the (then Intel) network drivers. That sometimes delayed the migration process. Disabling VMQ support on either the VM or the NIC cleared the issue. In the end, a driver update fixed it as well so we could return to using it.
Now, by default Windows clusters are very aggressive when it comes to detecting failure. IIRC 5 seconds of no heartbeat from another node is enough to kick it out of the cluster. You could change that to a slightly more relaxed configuration as described here: blogs.msdn.com/.../10370765.aspx
I'd suggest figuring out the root cause for the delay, but tuning the cluster timeouts should be ok for as long as investigations take.