47 Posts

April 1st, 2013 15:00

Are those guests that you see the issue with using iSCSI inside the VM, or is it a mix?

30 Posts

April 2nd, 2013 06:00

Yes, the guests are themselves connected to the SAN using iSCSI with the Dell EqualLogic HIT Kit for mpio, etc.

47 Posts

April 2nd, 2013 12:00

How is the load on the live migration network during a live migration? And how is the Power profile configured on the hosts?

30 Posts

April 2nd, 2013 13:00

It's a dedicated Live Migration network with usually at least a 90% utilization during a live migration.  All HyperV hosts have their power plan set to High Performance.

April 2nd, 2013 14:00

You said 2 host NICs are dedicated to iSCSI for guests... are they connected to two separate vSwitches and each VM connected to the two switches or are both NICs teamed and connected to a single vSwitch?

And do all the NICs and switch ports and in the storage path use Jumbo Frames?

30 Posts

April 3rd, 2013 06:00

Both iSCSI NICs (not teamed) are connected to a single vSwitch.

Yes, all NICs and switch ports in the storage path use Jumbo Frames.

30 Posts

April 9th, 2013 12:00

Sounds like something similar but does not involve Live Migration or HyperV:

en.community.dell.com/.../19480319.aspx

April 19th, 2013 02:00

FWIW, I use clustered guests with Equallogic storage without problems. I do, however, recall encountering short hangs when live migrating back when the configuration was validated.

It turned out to be a glitch with VMQ support of the (then Intel) network drivers. That sometimes delayed the migration process. Disabling VMQ support on either the VM or the NIC cleared the issue. In the end, a driver update fixed it as well so we could return to using it.

Now, by default Windows clusters are very aggressive when it comes to detecting failure. IIRC 5 seconds of no heartbeat from another node is enough to kick it out of the cluster. You could change that to a slightly more relaxed configuration as described here: blogs.msdn.com/.../10370765.aspx

I'd suggest figuring out the root cause for the delay, but tuning the cluster timeouts should be ok for as long as investigations take.

47 Posts

April 22nd, 2013 15:00

Does the ports used for Hyper-V have STP enabled? If so try to disable that

30 Posts

April 23rd, 2013 10:00

Thanks Mirko, I appreciate this information.  Nearly all networks on the HyperV hosts are on Intel cards and I have verified that they all have VMQ disabled.  I believe the drivers and update package from Dell is what set them to disabled as I do not recall ever adjusting this setting manually.

I am using the embedded Broadcom adapters, but, just for host management connectivity and the HyperV Failover Cluster heartbeat network.  In my past experience, Broadcom adapters have been nothing but a disaster and avoid them whenever possible.  In any event, I did find that VMQ was enabled on the Broadcom adapters.

Then I found this post [1] which is a close match to my scenario.  I disabled VMQ on all Broadcom adapters via Device Manager and have been stress testing Live Migrations endlessly.  The issue still occurs, although I can only reproduce it much less frequently.  It suggests disabling VMQ (on the adapters having nothing to do with VM traffic or live migration, mind you) have made a modest improvement.  Yet, there is still something that doesn't allow for a smooth migration 100% of the time.

[1] www.aidanfinn.com

I agree that increasing the cluster timeout will only mask the problem.  How did you discover your issue was related to VMQ?  Do you have any other suggestions?  I'm trying to have Dell escalate this case to Microsoft, but, Microsoft is insisting that I jump though hoops of fire to prove that I've tried everything in the known universe before they will accept the case.  I will not do this since it goes against the most basic of systems administrator practice to properly manage change control and to not change 10 things at once and hope something sticks.  So for the moment, the case is in limbo.

30 Posts

April 23rd, 2013 10:00

Hypervfan, we don't manage the network layer here, it's a separate group.  But, if I had to guess I'd say STP is enabled on those ports.  I'm not an expert with that but if STP were at play wouldn't 1) that cause the issue to occur every time? and 2) doesn't STP cause the port to temporarily shutdown the port for a longer period of time, longer than the 5 - 10 second live migration black-out that I experience?  Thanks for your input.

April 24th, 2013 05:00

I've had catastrophic incidents with both Intel and Broadcom cards. I believe both provide good hardware but have lousy quality control when it comes to drivers.

In some older clusters I'm still using 1Gig Intel ET dual port adapters and had nothing but blue screens and mysterious problems when we started out. Potentially interesting to you might be that I also started with the Dell provided driver bundles, but moved on to official Intel drivers to fix all issues. I've had a lengthy support case with Dell regarding these cards, and we determined that the issues I was having were only there with Dell supplied drivers. Using drivers from Intel would not be an issue in future cases with Dell I was assured, which has been proven true a few times.

The Dell system bundle v410 for R910s was current back then and had the faulty drivers,  Intel drivers v16.8.1 were current and worked fine. Of course all my problems resurfaced with Intel's early 17.x line (hence I say lousy quality control), and were fixed (again) in a later release.

As to how we discovered VMQ was causing the problem: The problem of having a few seconds outage during live migration only affected VMs that were configured to use VMQ (file server cluster nodes where we expected to see much inbound traffic). Having VMQ and VLAN tagging enabled at the same time could also crash the host upon VM startup. Fun times.

30 Posts

April 24th, 2013 09:00

Interesting.  Dell has stressed that I update all drivers and only used the updates posted on the Dell support site.  Although the Intel driver update package posted on their site has a fairly recent date (late 2012 or early 2013), it does not actually update the driver itself, though it does update the PROSet utility.  The readme indicates that the driver stays the same as the version I already have installed which dates from June 2012.  I do see that the Intel site has a driver update from just last month, but, have hesitated to try that given advisement above.

Also, what are your settings for RSS, offloading, flow control regarding the following networks:

- live migration network

- csv network

- virtual switch

I can't seem to find consistent documentation about these.  I've disabled some of these on some networks for specific reasons, but, wonder if they should all be disabled across all networks.  Disabling VMQ has seemed to improved things to some extent.  This morning I'm testing disabling flow control on the virtual switch adapter and early results suggest this may offer an improvement as well.

April 24th, 2013 10:00

On the remaining 1GB clusters' non-iSCSI interfaces I do have RSS enabled on all networks, flow control is off and chimney offloading as well. But I don't think these settings could cause / solve your issues.  

Interesting your support rep would advise against using the Intel driver kit, AFAIK there is no OEM'd network hardware beside what's on the motherboards. I'd try the Intel driver route anyway.

30 Posts

April 26th, 2013 09:00

I just finished installing the latest drive directly from Intel from 3/2013 and also updated the Broadcom drivers as well.  I can still reproduce the issue, although it seems less frequent than when i originally discovered the issue.  I'm still working to get Dell to escalate this issue to Microsoft.

I have to wonder if a 3-ping loss at the end of a live migration would be considered normal and acceptable according to Microsoft.  If so, this means any live migration of a guest cluster node runs the risk of disrupting that guest cluster.  This would be really unfortunate if it turns out to be the case.

No Events found!

Top