Start a Conversation

Unsolved

This post is more than 5 years old

162933

December 12th, 2009 06:00

Random Reboots with a R710 and Windows 2008 R2

I have a brand new server implementation with a R710 and Windows Server 2008 R2.

The server has been through two unexplained and random reboots so far.

There has been no blue screen of death and no bugchecks.

This is a simple server installation who's role is simply Hyper-V v2 and that's it.

Is anyone else experiencing this issue? All firmware is at the latest.

How does one go about troubleshooting a problem such as this?

Of course, since Dell OpenManage 6.2 isn't out yet I am unable to perform any sort of diagnostics or log examination.

Thanks.

13 Posts

December 12th, 2009 07:00

I just found a possible reason and resolution.

Is anyone else experiencing this same problem?

http://blogs.msdn.com/virtual_pc_guy/archive/2009/10/16/hyper-v-hotfix-for-0x00000101-clock-watchdog-timeout-on-nehalem-systems.aspx

January 13th, 2010 02:00

We have the same problem. 2 R710 with Windows 2008 R2, Hypervisor-Role and random reboots if server is under heavy load. I will try the hotfix.

13 Posts

January 29th, 2010 07:00

Looks like I spoke too soon.

After disabling the C1E state in the server's BIOS our Dell R710 running Hyper-V R2 on 2008 R2 bluescreened last night.

It was about 7 weeks since the last bluescreen due to this bug.

Is anyone else still seeing bugchecks because of this problem even after disabling C1E in the BIOS?

4 Posts

July 6th, 2010 18:00

I am also experiencing this exact same issue.  We have two identically configured PE R 710 servers. Only one is exhibiting the signs of this issue while the other is not.  Dell Tech Support confirmed we should disable C1-E in the BIOS.  The only noticeable change is, with C1-E disabled, the server just completely freezes, including from DRAC, and the only way to bring it back is a full power cycle manually (in our case via DRAC).  By leaving the C1-E state enabled, the server still crashes but it Blue Screens and automatically reboots itself which is much more useful than it halting.

http://www.mbccs.com

347 Posts

July 7th, 2010 09:00

if time permits, my first test would be to swap all the memory between the systems, see if the problem follows the memory. you could also swap the hard drives between systems as well (be careful when importing the discs between systems, if possible, backup the systems first).

22 Posts

July 9th, 2010 07:00

For R710 system reboot- Not sure what type of load is in the server but try the patch # KB975530 and see if you still see the behavior. 

For Processor C1E state – Try the patch # KB974090.

Thanks

13 Posts

July 9th, 2010 07:00

It turns out that I had to end-up disabilng all of the C-States in the BIOS.

Even after disabling C1E I was still getting BSOD's.

347 Posts

July 9th, 2010 08:00

I am being told that its one or the other, disabling c states and hotifx together shouldnt be necessary, are you seeing otherwise?

13 Posts

July 9th, 2010 08:00

REYBEAST1 brings up a good point.

In addition to disabling C1E and all of the otehr C states make sure to install this KB hotfix.

After doing the above my BSOD's disappeared altogether.

347 Posts

July 9th, 2010 08:00

This is a better explanation of the issue and possible work arounds
http://support.microsoft.com/kb/975530

13 Posts

July 9th, 2010 09:00

You are essentially correct insofar as I have not tested that configuration.

I would do the following steps in this order:

1) Disable C1E.

2) Install Hotfix.

3) Disable all C States.

Although I've done all of the above I would go in order of 1, 2 and then do 3 if 1 & 2 are not fixing the problem.

3 Posts

July 16th, 2010 06:00

There are several issues going on in this thread vs a common issue, imho - some are getting blue screens and some are not (some just reboots) and there is little info to link the issues other than a "reboot", so I'd prefer to be cautious and consider each issue as a separate issue.

1) For Hyper-V, install KB975530 - this fixes issues with Nehalem procs that are seen when C-states are enabled in BIOS (this is not C1E, but all C-states).  If this KB is installed, C-states can remain enabled.

In Dell BIOS, under the CPU category, there is a "C-State" enable/disable field - this is the field that I am referring to. It looks like there is also a separate field for C1E - but this would likely not impact the watchdog timeout bluescreen fixed by KB975530 - all C-states must be disabled OR the KB must be installed and c-states can remain enabled.

2)  The other KB mentioned in this thread - KB974090 - is associated with C1E, but is not, to my knowledge, associated with any blue screens - it simply improves C1E usage.  That can also be installed, if you wish.

3)  WRT blue screens - without knowing exactly what the blue screen message is, I can't comment - can I get the exact blue screen message?  Since it reboots upon blue screen, if you want to halt the system on blue screen to read the error, you can do the following:

1)   Open a command prompt

2)   Run regedit

3)   Go to HKLM\SYSTEM\CurrentControlSet\Control\CrashControl

a.    Modify the Autoreboot setting from 1 to 0

b.    Reboot the system after modifying the registry, then retest the scenario where the failure occurs.

c.    The OS will then blue screen, but remain at the blue screen so we can read the blue screen error message

You then have to  reboot the system manually once you get that info.  In addition to getting the blue screen info, you may also see if it writes a memory.dmp file (located in c:\windows) - it will show on the blue screen if it wrote a dump file.  This can also be used to get much more information on the exact root-cause.

No Events found!

Top