Highlighted
2 Bronze

Live Migration causing Failover Cluster connectivity issues

Hi,

 

I’m noticing some unexpected network behavior on some of my HyperV guests after performing live migrations.

 

The HyperV environment:

 

5-node Windows Server 2008 R2 SP1  HyperV Failover Cluster, hosting 14 guests

Each host server has the following network configuration, each on separate subnets:

1 NIC for dedicated host management

1 NIC for Cluster Shared Volume network

1 NIC for Cluster heartbeat

1 NIC for Live Migration network

2 NICs for host iSCSI using Dell EqualLogic HIT Kit for MPIO, etc

1 NIC for HyperV switch

2 NICs for iSCSI within HyperV guests

 

I can reproduce the following behavior fairly consistently, although it does not happen every time:

 

  • 2 guests are themselves clustered using Microsoft Failover Clustering to provide HA file server, each typically resides on separate HyperV hosts.
  • When I Live Migrate either of those guests, the process completes, but after the Live Migration process reaches 99% complete, it changes to a status of “Pending” where it seems to hang for several seconds (~ 4 to 5) before changing back to “Online” and completing the migration.
  • During this time, a continuous ping of the guest being migrated consistently drops 3 pings while in the “Pending” status.
  • When viewing the System Event log on the host just migrated, I immediately begin receiving the following errors:
    • Event ID 7 errors from the iSCSIPrt source stating “The initiator could not send an iSCSI PDU. Error status is given in the dump data.”
    • Event ID 1135 errors from FailoverClustering stating the other cluster node (not live migrated) was removed from the cluster.  Likewise, the event log on the non-migrated node reports that the migrated node was also removed from the cluster.  After ~ 30 to 60 seconds, the cluster reports the migrated node as available again.  Note, I never lose RDP connectivity to the migrated node even though the cluster manager reports it as down.
    • When a Live Migration does not experience this behavior, the change of status from “Pending” to “Online” happens nearly instantly within 1 second and it only ever drops 1 ping, no more, no less.
    • The issue is not specifically tied to clustered guests because I receive the same Event ID 7 errors on non-clustered standalone guests after a Live Migration.  The cluster just makes the issue more visible.

 

Things I’ve investigated:

 

  • Disabled TOE, RSS, etc on all HyperV hosts and guests, all environments report the following:
    • netsh int tcp show global > c:\text.txt
    • Querying active state...
    • TCP Global Parameters
    • ----------------------------------------------
    • Receive-Side Scaling State          : disabled
    • Chimney Offload State               : disabled
    • NetDMA State                        : enabled
    • Direct Cache Acess (DCA)            : disabled
    • Receive Window Auto-Tuning Level    : disabled
    • Add-On Congestion Control Provider  : ctcp
    • ECN Capability                      : disabled
    • RFC 1323 Timestamps                 : disabled

  • Disabled similar settings on each individual network adapter on all HyperV hosts and guests
  • Investigated and applied all relevant hotfixes for HyperV and Failover Clustering
  • Verified network binding order
  • Verified network prioritization for the HyperV Failover Cluster and guests are using the proper network for Live Migration
  • Tested disabling firewalls at host and guest level
  • Behavior is not isolated when migrating to or from any particular HyperV host

 

It seems the iSCSIPrt errors and Failover Clustering errors merely result from the longer than expected “Pending” status change.  But, I have no idea how to further troubleshoot what could be causing that behavior during the Live Migration.

 

Any suggestions are much appreciated.

 

Thanks,

Ryan

0 Kudos
Reply
17 Replies
Highlighted
3 Argentum

Are those guests that you see the issue with using iSCSI inside the VM, or is it a mix?

Dell TechCenter Rockstar

0 Kudos
Reply
Highlighted
2 Bronze

Yes, the guests are themselves connected to the SAN using iSCSI with the Dell EqualLogic HIT Kit for mpio, etc.

0 Kudos
Reply
Highlighted
3 Argentum

How is the load on the live migration network during a live migration? And how is the Power profile configured on the hosts?

Dell TechCenter Rockstar

0 Kudos
Reply
Highlighted
2 Bronze

It's a dedicated Live Migration network with usually at least a 90% utilization during a live migration.  All HyperV hosts have their power plan set to High Performance.

0 Kudos
Reply
Highlighted
2 Bronze

You said 2 host NICs are dedicated to iSCSI for guests... are they connected to two separate vSwitches and each VM connected to the two switches or are both NICs teamed and connected to a single vSwitch?

And do all the NICs and switch ports and in the storage path use Jumbo Frames?

0 Kudos
Reply
Highlighted
2 Bronze

Both iSCSI NICs (not teamed) are connected to a single vSwitch.

Yes, all NICs and switch ports in the storage path use Jumbo Frames.

0 Kudos
Reply
Highlighted
2 Bronze

Sounds like something similar but does not involve Live Migration or HyperV:

en.community.dell.com/.../19480319.aspx

0 Kudos
Reply
Highlighted

FWIW, I use clustered guests with Equallogic storage without problems. I do, however, recall encountering short hangs when live migrating back when the configuration was validated.

It turned out to be a glitch with VMQ support of the (then Intel) network drivers. That sometimes delayed the migration process. Disabling VMQ support on either the VM or the NIC cleared the issue. In the end, a driver update fixed it as well so we could return to using it.

Now, by default Windows clusters are very aggressive when it comes to detecting failure. IIRC 5 seconds of no heartbeat from another node is enough to kick it out of the cluster. You could change that to a slightly more relaxed configuration as described here: blogs.msdn.com/.../10370765.aspx

I'd suggest figuring out the root cause for the delay, but tuning the cluster timeouts should be ok for as long as investigations take.

0 Kudos
Reply
Highlighted
3 Argentum

Does the ports used for Hyper-V have STP enabled? If so try to disable that

Dell TechCenter Rockstar

0 Kudos
Reply