bobst_its

30 Posts

124215

April 1st, 2013 13:00

Live Migration causing Failover Cluster connectivity issues

Hi,

I’m noticing some unexpected network behavior on some of my HyperV guests after performing live migrations.

The HyperV environment:

5-node Windows Server 2008 R2 SP1 HyperV Failover Cluster, hosting 14 guests

Each host server has the following network configuration, each on separate subnets:

1 NIC for dedicated host management

1 NIC for Cluster Shared Volume network

1 NIC for Cluster heartbeat

1 NIC for Live Migration network

2 NICs for host iSCSI using Dell EqualLogic HIT Kit for MPIO, etc

1 NIC for HyperV switch

2 NICs for iSCSI within HyperV guests

I can reproduce the following behavior fairly consistently, although it does not happen every time:

2 guests are themselves clustered using Microsoft Failover Clustering to provide HA file server, each typically resides on separate HyperV hosts.
When I Live Migrate either of those guests, the process completes, but after the Live Migration process reaches 99% complete, it changes to a status of “Pending” where it seems to hang for several seconds (~ 4 to 5) before changing back to “Online” and completing the migration.
During this time, a continuous ping of the guest being migrated consistently drops 3 pings while in the “Pending” status.
When viewing the System Event log on the host just migrated, I immediately begin receiving the following errors:
- Event ID 7 errors from the iSCSIPrt source stating “The initiator could not send an iSCSI PDU. Error status is given in the dump data.”
- Event ID 1135 errors from FailoverClustering stating the other cluster node (not live migrated) was removed from the cluster. Likewise, the event log on the non-migrated node reports that the migrated node was also removed from the cluster. After ~ 30 to 60 seconds, the cluster reports the migrated node as available again. Note, I never lose RDP connectivity to the migrated node even though the cluster manager reports it as down.
- When a Live Migration does not experience this behavior, the change of status from “Pending” to “Online” happens nearly instantly within 1 second and it only ever drops 1 ping, no more, no less.
- The issue is not specifically tied to clustered guests because I receive the same Event ID 7 errors on non-clustered standalone guests after a Live Migration. The cluster just makes the issue more visible.

Things I’ve investigated:

Disabled TOE, RSS, etc on all HyperV hosts and guests, all environments report the following:
- netsh int tcp show global > c:\text.txt
- Querying active state...
- TCP Global Parameters
- ----------------------------------------------
- Receive-Side Scaling State : disabled
- Chimney Offload State : disabled
- NetDMA State : enabled
- Direct Cache Acess (DCA) : disabled
- Receive Window Auto-Tuning Level : disabled
- Add-On Congestion Control Provider : ctcp
- ECN Capability : disabled
- RFC 1323 Timestamps : disabled
Disabled similar settings on each individual network adapter on all HyperV hosts and guests
Investigated and applied all relevant hotfixes for HyperV and Failover Clustering
Verified network binding order
Verified network prioritization for the HyperV Failover Cluster and guests are using the proper network for Live Migration
Tested disabling firewalls at host and guest level
Behavior is not isolated when migrating to or from any particular HyperV host

It seems the iSCSIPrt errors and Failover Clustering errors merely result from the longer than expected “Pending” status change. But, I have no idea how to further troubleshoot what could be causing that behavior during the Live Migration.

Any suggestions are much appreciated.

Thanks,

Ryan

Responses(17)

B

bobst_its

30 Posts

0

May 3rd, 2013 10:00

Mirko, your last comment got me thinking in other directions, first stop antivirus.

We use Symantec Endpoint Protection and already have the recommend exclusions configured. In verifying this, I found this post [1] and noted the comment about does this apply to real-time or scheduled scans. I realized that in Symantec it doesn't indicate whether the exclusions we've applied are for real-time, scheduled, or both. I could find no way to ensure the exclusions apply to real-time.

[1] social.technet.microsoft.com/.../2179.hyper-v-anti-virus-exclusions-for-hyper-v-hosts.aspx

My working theory is that as the live migration completes and a new Hyper-V process starts up on the host where the VM just migrated to, if real-time scanning kicks in for just a second or two, that's probably enough to make the black-out last longer, the VM to freeze momentarily, and a clustered guest to trigger a cluster event.

I've since disabled real-time scanning on my hosts. Conducted exactly 100 live migration tests of all guests in various directions on hosts and was only able to reproduce the issue 1 time. I'm willing ignore that 1 case. I'm cautiously optimistic that this will resolve the issue and for the sake of my sanity I'm willing to live with real-time scanning disabled on the hosts. Thanks for all your input, it's much appreciated.

mirko.schnellba

25 Posts

0

May 4th, 2013 07:00

Excellent, glad to hear this!

Regarding the virus scanner: I'm not familiar with Symantec Endpoint Protection but I'd wager it has some form of networking extension / filter to block suspicious traffic and or behaviour. Often this uses heuristics to combat zero-day or unknown attacks.

I guess a VM attaching to a network interface and sending a gratuitous ARP to notify of the IP address movement from a different host to the local machine could be noteworthy from the POV of antivirus software.

Should it interfere / slow down the process of the recently migrated VM aquiring its network connectivity, it might indeed explain your symptoms.

1
2

View All

No Events found!

Virtualization General

Live Migration causing Failover Cluster connectivity issues