Live Migration causing Failover Cluster connectivity issues

Question

Hi,

I’m noticing some unexpected network behavior on some of my HyperV guests after performing live migrations.

The HyperV environment:

5-node Windows Server 2008 R2 SP1 HyperV Failover Cluster, hosting 14 guests

Each host server has the following network configuration, each on separate subnets:

1 NIC for dedicated host management

1 NIC for Cluster Shared Volume network

1 NIC for Cluster heartbeat

1 NIC for Live Migration network

2 NICs for host iSCSI using Dell EqualLogic HIT Kit for MPIO, etc

1 NIC for HyperV switch

2 NICs for iSCSI within HyperV guests

I can reproduce the following behavior fairly consistently, although it does not happen every time:

2 guests are themselves clustered using Microsoft Failover Clustering to provide HA file server, each typically resides on separate HyperV hosts.
When I Live Migrate either of those guests, the process completes, but after the Live Migration process reaches 99% complete, it changes to a status of “Pending” where it seems to hang for several seconds (~ 4 to 5) before changing back to “Online” and completing the migration.
During this time, a continuous ping of the guest being migrated consistently drops 3 pings while in the “Pending” status.
When viewing the System Event log on the host just migrated, I immediately begin receiving the following errors:
- Event ID 7 errors from the iSCSIPrt source stating “The initiator could not send an iSCSI PDU. Error status is given in the dump data.”
- Event ID 1135 errors from FailoverClustering stating the other cluster node (not live migrated) was removed from the cluster. Likewise, the event log on the non-migrated node reports that the migrated node was also removed from the cluster. After ~ 30 to 60 seconds, the cluster reports the migrated node as available again. Note, I never lose RDP connectivity to the migrated node even though the cluster manager reports it as down.
- When a Live Migration does not experience this behavior, the change of status from “Pending” to “Online” happens nearly instantly within 1 second and it only ever drops 1 ping, no more, no less.
- The issue is not specifically tied to clustered guests because I receive the same Event ID 7 errors on non-clustered standalone guests after a Live Migration. The cluster just makes the issue more visible.

Things I’ve investigated:

Disabled TOE, RSS, etc on all HyperV hosts and guests, all environments report the following:
- netsh int tcp show global > c:\text.txt
- Querying active state...
- TCP Global Parameters
- ----------------------------------------------
- Receive-Side Scaling State : disabled
- Chimney Offload State : disabled
- NetDMA State : enabled
- Direct Cache Acess (DCA) : disabled
- Receive Window Auto-Tuning Level : disabled
- Add-On Congestion Control Provider : ctcp
- ECN Capability : disabled
- RFC 1323 Timestamps : disabled
Disabled similar settings on each individual network adapter on all HyperV hosts and guests
Investigated and applied all relevant hotfixes for HyperV and Failover Clustering
Verified network binding order
Verified network prioritization for the HyperV Failover Cluster and guests are using the proper network for Live Migration
Tested disabling firewalls at host and guest level
Behavior is not isolated when migrating to or from any particular HyperV host

It seems the iSCSIPrt errors and Failover Clustering errors merely result from the longer than expected “Pending” status change. But, I have no idea how to further troubleshoot what could be causing that behavior during the Live Migration.

Any suggestions are much appreciated.

Thanks,

Ryan

hypervfan · Answer

Are those guests that you see the issue with using iSCSI inside the VM, or is it a mix?

bobst_its · Answer

Yes, the guests are themselves connected to the SAN using iSCSI with the Dell EqualLogic HIT Kit for mpio, etc.

hypervfan · Answer

How is the load on the live migration network during a live migration? And how is the Power profile configured on the hosts?

bobst_its · Answer

It's a dedicated Live Migration network with usually at least a 90% utilization during a live migration.  All HyperV hosts have their power plan set to High Performance.

brian_gautreau · Answer

You said 2 host NICs are dedicated to iSCSI for guests... are they connected to two separate vSwitches and each VM connected to the two switches or are both NICs teamed and connected to a single vSwitch?

And do all the NICs and switch ports and in the storage path use Jumbo Frames?

bobst_its · Answer

Both iSCSI NICs (not teamed) are connected to a single vSwitch. Yes, all NICs and switch ports in the storage path use Jumbo Frames.

bobst_its · Answer

Sounds like something similar but does not involve Live Migration or HyperV: en.community.dell.com/.../19480319.aspx

mirko.schnellba · Answer

FWIW, I use clustered guests with Equallogic storage without problems. I do, however, recall encountering short hangs when live migrating back when the configuration was validated.

It turned out to be a glitch with VMQ support of the (then Intel) network drivers. That sometimes delayed the migration process. Disabling VMQ support on either the VM or the NIC cleared the issue. In the end, a driver update fixed it as well so we could return to using it.

Now, by default Windows clusters are very aggressive when it comes to detecting failure. IIRC 5 seconds of no heartbeat from another node is enough to kick it out of the cluster. You could change that to a slightly more relaxed configuration as described here: blogs.msdn.com/.../10370765.aspx

I'd suggest figuring out the root cause for the delay, but tuning the cluster timeouts should be ok for as long as investigations take.

hypervfan · Answer

Does the ports used for Hyper-V have STP enabled? If so try to disable that

bobst_its · Answer

Thanks Mirko, I appreciate this information. Nearly all networks on the HyperV hosts are on Intel cards and I have verified that they all have VMQ disabled. I believe the drivers and update package from Dell is what set them to disabled as I do not recall ever adjusting this setting manually.

I am using the embedded Broadcom adapters, but, just for host management connectivity and the HyperV Failover Cluster heartbeat network. In my past experience, Broadcom adapters have been nothing but a disaster and avoid them whenever possible. In any event, I did find that VMQ was enabled on the Broadcom adapters.

Then I found this post [1] which is a close match to my scenario. I disabled VMQ on all Broadcom adapters via Device Manager and have been stress testing Live Migrations endlessly. The issue still occurs, although I can only reproduce it much less frequently. It suggests disabling VMQ (on the adapters having nothing to do with VM traffic or live migration, mind you) have made a modest improvement. Yet, there is still something that doesn't allow for a smooth migration 100% of the time.

[1] www.aidanfinn.com

I agree that increasing the cluster timeout will only mask the problem. How did you discover your issue was related to VMQ? Do you have any other suggestions? I'm trying to have Dell escalate this case to Microsoft, but, Microsoft is insisting that I jump though hoops of fire to prove that I've tried everything in the known universe before they will accept the case. I will not do this since it goes against the most basic of systems administrator practice to properly manage change control and to not change 10 things at once and hope something sticks. So for the moment, the case is in limbo.

bobst_its · Answer

Hypervfan, we don't manage the network layer here, it's a separate group.  But, if I had to guess I'd say STP is enabled on those ports.  I'm not an expert with that but if STP were at play wouldn't 1) that cause the issue to occur every time? and 2) doesn't STP cause the port to temporarily shutdown the port for a longer period of time, longer than the 5 - 10 second live migration black-out that I experience?  Thanks for your input.

mirko.schnellba · Answer

I've had catastrophic incidents with both Intel and Broadcom cards. I believe both provide good hardware but have lousy quality control when it comes to drivers.

In some older clusters I'm still using 1Gig Intel ET dual port adapters and had nothing but blue screens and mysterious problems when we started out. Potentially interesting to you might be that I also started with the Dell provided driver bundles, but moved on to official Intel drivers to fix all issues. I've had a lengthy support case with Dell regarding these cards, and we determined that the issues I was having were only there with Dell supplied drivers. Using drivers from Intel would not be an issue in future cases with Dell I was assured, which has been proven true a few times.

The Dell system bundle v410 for R910s was current back then and had the faulty drivers, Intel drivers v16.8.1 were current and worked fine. Of course all my problems resurfaced with Intel's early 17.x line (hence I say lousy quality control), and were fixed (again) in a later release.

As to how we discovered VMQ was causing the problem: The problem of having a few seconds outage during live migration only affected VMs that were configured to use VMQ (file server cluster nodes where we expected to see much inbound traffic). Having VMQ and VLAN tagging enabled at the same time could also crash the host upon VM startup. Fun times.

bobst_its · Answer

Interesting. Dell has stressed that I update all drivers and only used the updates posted on the Dell support site. Although the Intel driver update package posted on their site has a fairly recent date (late 2012 or early 2013), it does not actually update the driver itself, though it does update the PROSet utility. The readme indicates that the driver stays the same as the version I already have installed which dates from June 2012. I do see that the Intel site has a driver update from just last month, but, have hesitated to try that given advisement above.

Also, what are your settings for RSS, offloading, flow control regarding the following networks:

- live migration network

- csv network

- virtual switch

I can't seem to find consistent documentation about these. I've disabled some of these on some networks for specific reasons, but, wonder if they should all be disabled across all networks. Disabling VMQ has seemed to improved things to some extent. This morning I'm testing disabling flow control on the virtual switch adapter and early results suggest this may offer an improvement as well.

mirko.schnellba · Answer

On the remaining 1GB clusters' non-iSCSI interfaces I do have RSS enabled on all networks, flow control is off and chimney offloading as well. But I don't think these settings could cause / solve your issues.

Interesting your support rep would advise against using the Intel driver kit, AFAIK there is no OEM'd network hardware beside what's on the motherboards. I'd try the Intel driver route anyway.

bobst_its · Answer

I just finished installing the latest drive directly from Intel from 3/2013 and also updated the Broadcom drivers as well. I can still reproduce the issue, although it seems less frequent than when i originally discovered the issue. I'm still working to get Dell to escalate this issue to Microsoft.

I have to wonder if a 3-ping loss at the end of a live migration would be considered normal and acceptable according to Microsoft. If so, this means any live migration of a guest cluster node runs the risk of disrupting that guest cluster. This would be really unfortunate if it turns out to be the case.

Virtualization General

Live Migration causing Failover Cluster connectivity issues

Was this post helpful?