PE T320 Hyper-V Guest latency fixed by BIOS update to Host

Question

Twice now I have had the Exchange 2010 guest on my W2K8 Standard Server with Hyper-V suddenly suffer a massive amount of latency and become practically locked up out of the blue. After hours of troubleshooting, rebooting, and researching - as a last-ditch effort I noticed that the BIOS was out of date. I upgrade BIOS and then everything magically works again.

Anywhere from 2-4 months pass and rinse & repeat.

I'm going out on a limb here but I suspect that there's something about the system that gets flushed out during a bios update that is causing the problem, but I'm at a loss to determine what that is. I want to proactively avoid having my customer's Exchange server fail, but it goes against my better judgment to apply BIOS updates proactively. We've had the motherboard swapped out already due to a bad DRAC, and the problem keeps happening.

I could use some help in identifying what else I can do to prevent this from happening in the future. TIA

LSLC · Accepted Answer

Well we haven't seen the issue return at all since we changed the System Profile from 'Performance per Watt' to 'Performance'.  I would start there first.  It's been three months and none of our T320's do this anymore.  We're 90% sure this is the solution.

Daniel My · Answer

Hello swade2569

A BIOS update should flush cache and memory throughout the system.

Are you having issues with the host operating system or any other VMs, or is it limited to this one VM?
Are you using any technology that bypasses the hypervisor like SR-IOV?
When the VM locks up have you noticed if the memory or CPU utilization within the VM or host OS are maxed out?
Are there any errors in the hardware log of the iDRAC/BMC?
Any other info you can provide to help narrow down the cause?

Thanks

swade2569 · Answer

The host OS seems to run well enough. I don't see a lot of latency there. There are three VM's running. A W2K3 server which has no problems, The W2K8 Exchange server which runs just awful when this happens and a W2K8 DC which spikes every 3-4 seconds but runs ok.

Not using anything like SR-IOV.

The memory on the Exchange VM is not maxed out. I have it set dynamic and at worst it only uses about 2.5 GB. The CPU however is running at about 90-95% (as indicated on a host perfmon session, not the VM's resource monitor).

No hardware errors are being registered.

I suspect cache and memory are only going to be flushed on a BIOS update or complete shutdown, which isn't very helpful since this server is not readily accessible. We reboot it once a month at minimum but that doesn't seem sufficient.

Daniel My · Answer

Troubleshooting issues that occur so infrequent are very difficult. I would recommend assigning more CPU capacity to the VM if it is maintaining close to 100% utilization. Also, read through this post. This person had a very similar issue that was corrected by changing the VHD to static rather than a dynamically expanding drive:

http://community.spiceworks.com/topic/318917-hyper-v-unresponsive-for-about-20-30-minutes

Thanks

LSLC · Answer

We've had the identical problem you have with ALL of our Dell T320's, We have 16 of them, so it happens infrequently on any single T320, but it comes up at least once every fortnight on average for us. Since this happens to us fairly frequently, we have been able to learn a lot about the nature of the problem. Hopefully what we've learned can help you. We still haven't solved the problem, and Dell is completely clueless to the cause or even possible reason for this issue.

Reboots don't fix it
Electrically disconnecting and discharging the power supplies for over ten minutes solves the symptoms
Disconnecting the motherboard from the power supplies solves the symptoms
BIOS firmware updates solves the symptoms
Power supply firmware updates solves the symptoms

By "solve the symptoms ",I mean it stops the sluggish performance and Hype-V VM's return to normal performance. However in a month or two, the problem comes back randomly. Unfortunately every time we apply a BIOS update, it's like burning a match, and we're running out of BIOS updates to apply. We're already at 2.0.22 on all of our servers. The last thing we're in the process of trying is the system profile. If you F2 into the BIOS and select System Profile, change the profile from Performance per Watt to Performance. It solves the symptoms, and so far we haven't had any servers that have this setting changed go symptomatic on us. We're trying to do good science and have a control group and changed group. We have only been trying this for about a month. I'll report back in four months to let you know if this is an actual fix.

rmcompsol · Answer

Having similar issues! Have you determined if changing the System Profile from 'Performance per Watt' to 'Performance' fixes the issue?

rmcompsol · Answer

Just thought I would include more specifics about my issue in case others find this thread: Dell PowerEdge T320, Xeon E5 @ 2.4GHz, 16GB Memory, 4x600GB SAS 15K, RAID 10, Small Business Server 2011 So far this system has been hit a few times by the 'latency' issue described in this post. Out of nowhere the server seems to grind to a halt, as if the CPU is running at 100MHz instead of 2400MHz. The system still responds to commands and runs programs; just at a snail's pace. As mentioned in a previous post by LSLC, Rebooting the server does NOT fix the issue! However, I can confirm that installing a new BIOS update alleviates the symptoms (currently running 2.1.2). I also followed the recommendation to change the BIOS System Profile from 'Performance per Watt' to 'Performance'. I'll be monitoring this to see if this is a long-term fix (perhaps the latest BIOS update fixes the issue).

rmcompsol · Answer

I second that recommendation! BIOS upgrade didn't solve the problem long-term for us. Changing to 'Performance' seems to have fixed the issue permanently.

LSLC · Answer

I absolutely recommend changing the System Profile to "Performance" as your first step in solving the problem. We have had no further issues with our T320's after doing this. BIOS upgrades only temporarily fix the problem, so does draining the power supplies.

Onyx3821 · Answer

Hey guys, I'm having the SAME symptoms on a T320 a customer of ours runs. We're running the Hyper-V Core on this with several guest VMs. When there's a freeze up, rebooting doesn't fix the problem, either. I don't see a relief on these symptoms until going in through the DRAC and doing a power off/power on. I'm wondering if all you guys that are saying that the BIOS upgrades fix this are just seeing a coincidence of a power reboot freeing up the resources.

I'd like to hear if LSLC has seen the problems are still gone after all this time.

ryanjb2 · Answer

I can confirm that switching the System Profile from "Performance per Watt" to "Performance" in the BIOS fixes this problem.

Our month old PE T320 (Server 2012 R2) with 2 VMs (both 2012 R2) was performing well, then yesterday morning one of the VMs slowed to a crawl. It's like it was running at 100 Mhz. Everything still worked, just excruciatingly slowly. The host and other VM were fine. Rebooting the VM and whole server did nothing.

There is obviously a problem with the BIOS and Dell should look at this. That one setting change in the BIOS fixed it, no other changes in drivers or firmware were needed.

swade2569 · Answer

I can also confirm this issue is resolved by switching to Performance.

Tyson0317 · Answer

This weekend we hit this problem like a brick wall. Same Dell T320 as everyone else on this thread, this time running a Hyper-V Server 2012 R2 host and a pair of virtual machines - Server 2012 R2 Std and Win7Pro x86. One of my co-workers spent hours troubleshooting poor performance following the installation of the VMs to determine that it was a bad switch. Once the new switch was installed the problem persisted. Info below I think would help a developer determine what direction to look at:

The symptom was that with the network idle and unused (a small business with single DC and 5 workstations), both of the VMs would exhibit 180-200ms ping times. The VM OS seemed to not be a variable as, we saw the same exact behavior from our two VMs, one of which was x64, the other x86, one a server, the other a workstation, one with 8.1 kernel, the other with the 7 kernel. On both VMs network applications ran terribly, file copy speeds via network were poor, etc.

I brought a fresh set of eyes on the problem and by accident noticed something very odd. The host ping times were always normal - under 1ms. However when the host's physical network interface was pinged, the latency of the VM would drop to 1-2MS during just the split second when the host ping packet was being sent. We then utilized fping to get a better idea of what was happening. Fping, for those who dont know, allows for doing a fast-ping with many controllable variables and detailed outputs. You can sling ICMP packets as fast as you'd like and track jitter.

It was hard to explain how this could be happening, but fping confirmed that as soon as the host's NIC was tasked with replying to ICMP pings, the VM's NIC would come back to life and work properly, just for that split second. I was certain that this was some network anomaly with our workstation, but through more testing we confirmed that really something in the server's NIC hardware and hyperv emulation was really making this happen. We could run the Ping to the VM from one physical workstation and the ping to the host from another and the moment you hit Enter, the VM pings went from 200 to 0.2. Later we discovered that it did not matter if you pinged the host, or either of the VMs - as long as you were pushing some traffic to the NIC, other traffic worked better - it was as if the fping string was keeping the NIC from going into micro-sleeps.

Through trial and error, we found that setting the fping banging away at the host to 20ms interval was ideal - doing it slower would cause occasional 200ms packets and doing it any faster would actually cause complete packet loss. Matter of fact, if you set it to 1ms, we were able to completely flood the server into 90+% packet loss. If anyone wants to tank one of these servers, fping is your friend. Although leaving fping running in the background would be a viable ghetto solution, the network applications (although running much faster) were still not performing up to par.

After confirming that we were not crazy and we really were not looking at a cabling, switching, network overload, broadcast flooding or other similar problem, I decided to google it. Terms T320, Hyperv and lag, yielded a shocking number of results - this thread among them. Unfortunately, none of the fixes above worked for us. Our BIOS was at version 1.X and we went straight to latest 2.3.3 and the problem was still there. One of the posts above has a typo so I was not sure which power-saving/performance option we needed to set in the BIOS, so we tried all 4; none of them worked. I saw another post where this problem was resolved on a 2012 Server Standard running HyperV, by opening the NIC properties in the device manager and selecting ->Advanced->Virtual Machine Queues->Disabled. However, ours being the stripped down HyperV Server, there is no way to open the device manager to change that kind of stuff on the host.

What finally fixed it for us was a $15 network card installed in one of the motherboard expansion slots. Like a light switch, our ping times dropped to 0.1-0.2ms without running a second fping, network apps load up lightning quick and everything was working normal.

In the end, combining what we discovered with what I read here and on other posts I think the issue is somewhere between the tree-huger power-saving 'features' built into the BIOS of the T320 and the way that hyperv visualizes the onboard network card. Something in that loop feels like its putting the NIC to sleep for a split second and then has to wake it up. If you keep it awake by pinging at 20ms intervals, the problem greatly diminishes. Having other background traffic on the network similarly made the problem less noticeable. Which explains why network apps ran better while we were running a backup to a NAS and then halted a crawl once we cancelled the job. I bet that there are tons of these servers out there in deployment having this problem but being dismissed as a general performance issue.

mark.valpreda · Answer

I am so happy I found this thread. We have deployed a few T320s for clients and out of the blue they start running TERRIBLE. Some with 2008 R2, some with SBS 2011, some with 2012 R2, and some with VMs. It was always the T320s. It always looked to me like a disk I/O issue so I would talk Dell into replacing the PERC. That would usually fix it (probably because the system would be powered off and unplugged for a few minutes). I had another one do the same thing, client complained about crazy slow. They have a T320 with 32gb, 4x500gb on a PERC H710 +1GB and Windows 2012 R2 with 2x 2008 R2 VMs. There was no reason that the 4 people there should be running so bad. Did a BIOS update, changed the setting in OMSA and rebooted. Night and day. VMs that were PEGGED on CPU are sitting at 1-5%. So much happier. Told the rest of my team about this and having them check their T320s in the field.

Bruce Marden · Answer

Same issue Dell T320 Server 2012R2 HyperV with Exchange 2013 VM.

Out of no where CPU spikes at 100% on Exchange, shutting down VM does not help, host still laggy. Soft boot of host also does not help.

MS suggested BIOS and RAID FW, did those and problem seemed to go away, 3 weeks later same thing. Yet andother new BIOS FW, problem goes away, 4 weeks later same thing.

This time all I did was Hard Power Off and back on, problem goes away. MS Says Dell has a fix so I will be calling tomorrow. Will let you know.

In mean time weekly hard boot should keeps things stable based on my experience.

PowerEdge Hardware General

PE T320 Hyper-V Guest latency fixed by BIOS update to Host

Was this post helpful?

Register Now! for Secure Secrets with Kubernetes & DevSecOps - and grab a lab!