Start a Conversation

Unsolved

This post is more than 5 years old

324035

January 6th, 2010 06:00

PowerEdge 2650: "System Halted!"

I have a PE2650 that will occasionally fail during the boot process.  The exact error text reported is "System Halted!"  No additional details are given.  This problem is intermittent - most of the time the machine will boot normally.

When the error occurs I've observed the following:

+ The error consistently occurs all at once.  What I mean by this is I do not get mixed results during a set time period.  The error will occur every time I attempt to boot the machine and once it's gone the machine consistently boots normally.  For example, if I rebooted or powered down 100 times in a row then all 100 would be failures.  I would not get a mixed bag of 60 successes and 40 failures.

+ The keyboard is nonresponsive.

+ The RAC boot path analysis log gets to "POST Code 92 | Initializing ACPI" and then reports a failure thinking the system might be hung (which it is).

+ The "System Halted!" error occurs immediately after the the 5 seconds expire when the system is waiting for you to configure the RAC using CTL+D.  The RAC works fine so I'm sure this is not caused by the RAC.

+ The RAID controller works fine.  There is no evidence to suggest that the RAID configuration is an issue.

+ The server incorrectly reports having two, 2400 MHZ processors with a 400 MHZ bus speed.  Actually, this server has two 3066 MHZ processors with a 533 MHZ bus speed.  This property is 100% repeatable.  I can reliably predict that the system will hang as soon as it reports the incorrect processor and bus speeds.

+ This error occurs before the BIOS setup (F2) or Utilities (F10) applications can be entered.

When the error does not occur I've observed the following:

+ The OpenManage Server Administrator web site does not report any errors.

+ All memory tests pass.

+ The operating system - Windows Server 2003 R1 SP2 - works great.

While troubleshooting I've noticed that rebooting has never fixed the problem.  I have to do a cold restart and completely power off the machine.  However, we have to rule out a temperature issue because the "System Halted!" error will occur after the machine is off for days.  All sensors indicate temperatures are completely normal.

To me it seems the best thing to explore first is why would a system report an incorrect processor and/or bus speed?  Like I stated above, there is a 100% repeatable link between the system being able to boot and it reporting the correct speeds.

The following screenshot shows the beginning of a successful boot since it the processor and bus speeds are reported _correctly_ at 3066/533.  A "System Halted!" message would appear if the processor and bus speeds were reported incorrectly as 2400/400:

 

January 6th, 2010 06:00

Can the moderator please move this post to the appropriate forum.   This is not a RAID or storage related issue.

30 Posts

January 6th, 2010 16:00

Are you running the latest system BIOS?  Are the CPUs matched sSpec#?  I'd try reloading BIOS defaults and re-checking.  Then try resetting CMOS via jumper/battery removal.  If your still having problems I would pull the system down to one CPU and one stick of memory, reset CMOS and test again. If that doesn't help it's either a hardware (motherboard) issue or the BIOS design itself.

January 6th, 2010 19:00

Thank you for your response.

Yes, the system is running the most recent version of the BIOS - A21.  I'll try setting it to its defaults.  I don't think it's a BIOS _design_ problem because I have 4 of these machines with the exact same specifications and they have never experienced a problem.

With regards to the problem machine - both CPUs have been verified to have the same processor and bus speeds - so it can not be a mismatch error.

I did try removing a CPU to make it a 1 processor machine.  The RAM was swapped out as well.  The problem is that this issue is intermittent and difficult to repeat.  Sometimes I've just shut it down and it'll work fine after 5 minutes.

I'm going to attempt to reset the BIOS and CMOS as you suggested - that's a good idea.  You never know what's going on there.

The only thing that I can state with 100% certainty is if the processor and bus speeds are reported incorrectly at power-on then the "System Halted!" error message will occur.  If the processor and bus speeds are reported corrected then the error message does not occur.  Therefore, if we can get the system to identify the CPU/FSB speeds correctly this will solve the problem.

What utilities exist to test the system board?  Does Dell or Intel have a tool to check its integrity?  How do you isolate an issue like this?

I'd also like to hear your thoughts about this being a potential problem of the power supply not functioning correctly.

 

30 Posts

January 7th, 2010 16:00

Did you experience the bad behavior when running one CPU?  Also CPU's may be the same model & bus speed however they may not be matched sSpec#, there is a chance that may be causing problems during POST.  If you can get into the OS, run cpu-z and select each CPU separately using the drop-list and verify both CPU's are the same stepping & revision.  It could also be power supply related, that's a good area to check as well.  Dell offers higher output power supplies for some servers running dual processors, you could verify with them the PSU in this particular machine meets their spec.  If someone upgraded the server from it's initial configuration of one CPU for instance, you could encounter power issues.  Upgrading RAM and other hardware adds to power consumption, that's where running the system stripped down can help identify the source.

January 8th, 2010 08:00

Excellent troubleshooting advice.

I did not experience any problems when running exactly one CPU.  However, the problem is intermittent and it has been very difficult to isolate.

I'm glad you mentioned sSpec# again because I don't know what that is.  When I read your post before I just assumed it was a combination of the processor and bus speeds.  On the problem machine CPU-Z reports that both processors have identical stepping (9) and revision (D1) attributes.  The family, ext. family, model, and ext. model all match as well.

Assuming that a BIOS and CMOS reset does not yield results and that a power supply verification does not yield any results either - what would be the next step you would take to troubleshoot this?

 

30 Posts

January 8th, 2010 15:00

At least you're narrowing things down. The intermittent condition reminds me very much of several desktop boards from ASUS, DFI, Gigabyte and others that experience a similar problem.  These systems will refuse to POST when powered off for a period of time.  That's where the poor BIOS design comment came from.  The usual procedure on these systems experiencing a 'cold boot' problem is to remove or swap RAM in order to get the system POSTing again.  I have had desktop systems that also mis-ID the CPU during POST, it can be caused by a lack of BIOS support for a given CPU, running the system out of spec or even changing some common BIOS parameters.  It can also be caused by a bad BIOS flash.  In the retail market there are many different methods of BIOS flashing, some of them are out of date or are simply done incorrectly leading to trouble.  The cure in many cases is to simply re-flash the system BIOS. 

I were you at this point I'd strip the board of all hardware and clear CMOS (if the initial clear doesn't help) leaving the battery out for a couple of minutes at least.  The start with a single CPU and single stick of RAM and test for the condition.  If the system runs normally you can add the other components and test again.  From there you could look for damaged capacitors, or other board problems.  You could also swap hardware with one of the other servers you have and see if the trouble follows any particular piece of hardware, it's time consuming but in your case may be the only way to find out for sure exactly what the trouble is.  One more thing, is the RAM matched in this system?  Have you run any passes of Memtest?

January 9th, 2010 05:00

Yes, the RAM is identical in every way - same size, manufacturer.  All of the recommended MpMemory tests pass - so there is no evidence that a bad memory module is the source of the problem.

I'll follow your recommendations and see if I can get some consistency out of it.

Thank you for the assistance.

January 11th, 2010 10:00

This problem occurred again and I was able to get additional screenshots from the RAC console redirect.

The following image comes from the same machine - but this time you will notice that during the boot it reports incorrect processor and bus speeds of  2300 MHz and 400 MHz, respectively.  The correct speeds are 3066 MHz CPU and 533 MHz FSB.

And since there is a 100% correlation between this and the system halting I was able to grab a screenshot of the "System Halted!" message as well:

Please look for anything suspicious in here.  Perhaps some of the version numbers are known to be problematic.

I remember one thing that I did a few times that has fixed this every time.  Only power supply 1 is feeding this machine.  When this error occurs I shut down the machine and plug the cord into power supply 2 - so the only electrical source is power supply 2.  When I start the machine the processor and bus speeds are reported correctly and there are no errors.  I've done this at least 3 times.

The natural conclusion almost anyone would make is that power supply 1 is bad.  So if we follow the steps in the above paragraph in reverse then we should get the error again.  However, this does not hold.  If you power down the machine, switch the feed back to power supply 1, and then fire the machine back up there is no "System Halted!" error.  It makes this problem very puzzling.

The only conclusion we can make is that everything works fine if something is plugged into power supply 2.  I really don't get it.

Another observation is that this problem occurs more often after the machine has been powered down for a long time - 48 hours or more.

30 Posts

January 11th, 2010 13:00

Interesting.  Can you clarify what you mean by 'Power Supply'?  Do you mean a power conditioner/battery back-up device?

January 11th, 2010 14:00

I mean both of the onboard power supplies where you take the electrical cord and plug it directly into the machine.

This machine is set up just like a normal desktop with the A/C going directly into the machine from the wall.  There is no power conditioner or UPS or other device involved.

I now realize that the term "power supply" is a misnomer - but that's just the term people use for that component.  I don't know what else to call it - besides a really big, loud, heavy, onboard A/C adapter.

30 Posts

January 11th, 2010 15:00

I see, then this sounds like a cold boot problem and in this case is power supply related.  Is the configuration of the other four 2650's identical?  Try switching this redundant PSU into one of the other servers and test.

1 Message

April 10th, 2013 01:00

I had this exact System Halted problem on my 2650 dual cpu, I couldn't even get it to boot at all. I finally noticed one of the cpu fan clips was not clipped correctly and the cpu fan was not touching the cpu on one side, it was titled/raised slightly from the pressure of the clip on the other side. Reseated the clip and it booted up first try, no problems yet. So anyway this appears to be an error regarding cpu, check the cpu fan...

No Events found!

Top