Unsolved

This post is more than 5 years old

12244

December 30th, 2004 19:00

PE1750 consistent system lock ups

I'm hoping that someone here will be able to help me with my troubles.  Earlier this month, I bought a PowerEdge 1750 and have in a colocated environment.  Ever since I dropped it off at the data center, I've been having a whole mess of system lock ups.
 
First, a little about the system.  It is a basic PE1750 with dual 2.4ghz Xeons, redundant PSU,  and ERA/O.  The system came with 1gb RAM and one 36gb 10krpm drive.  I swapped out the ram and hard drives for 2gb SimpleTech DDR266 Reg ECC ram I bought new and 3x 36gb 15krpm drives.  The hard drives turned out being refurbished (which I didn't know when I ordered them, they didn't state it anywhere).  I complained to the company, but ended up keeping them.  Would have bought the components from Dell, but face it, Dell isn't the cheapest for memory and HD upgrades.
 
The system is running Windows Small Business Server 2003 Premium edition with Exchange, IIS, and SQL Server running.  Aside from those, there isn't really anything else running (besides OpenManage and the Dell utils).  I've run this same software config before on other servers, so I am confident the software/OS is kosher.
 
When the system locks up, it basically just completely stops responding.  I can't get direct console access since it is at the data center, and remote management card won't show anything from the console or error screen.  All I can really do is have it reset the system (which takes a couple requests sometimes).  There is almost nothing in the event log... it is like it just decides it going to stop everything.
 
I have managed to locate a couple of errors that might help.  First, in the ERA/O, it shows a couple of errors in the hardware log as follows, all at the same time:
- System software event - CPU Bus Parity Error detected.
- System software event - CPU Internal Error detected.
- System software event - CPU Internal Error detected.
 
I've also got a couple of errors in the system event log that occur sometimes just before it stops responding:
 
(shows this one twice)
Event Type: Error
Event Source: WMIxWDM
Event Category: None
Event ID: 107
Date:  12/30/2004
Time:  8:43:06 AM
User:  N/A
Computer: CHEF
Description:
Machine Check Event reported is a fatal error.
 
Event Type: Error
Event Source: symmpi
Event Category: None
Event ID: 15
Date:  12/30/2004
Time:  8:41:06 AM
User:  N/A
Computer: CHEF
Description:
The device, \Device\Scsi\symmpi1, is not ready for access yet.
 
Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 11
Date:  12/30/2004
Time:  8:41:06 AM
User:  N/A
Computer: CHEF
Description:
The driver detected a controller error on \Device\Harddisk0.
 
It only has extended info on the last disk error.  It says "This problem is typically caused by a failing cable that connects the drive to the computer" and to replace the cable.  There is no cable though, so it isn't that easy.  Maybe the SCSI backpanel?
 
I have run all of the diagnostic tests in OpenManage and they all have passed.  I had the system for about a week before taking it to the data center and didn't have any lock ups, but I hadn't had it running for more than a couple of hours to install/configure the system and run all the diagnostics before dropping it off.
 
From this info, I'm guessing the culprit is one of the following:
 - Bad CPU
 - Bad onboard SCSI controller or backpanel
 - Bad hard drive
 
Any ideas would be GREATLY appreciated.  I'd really like to get this fixed, since I'm not yet comfortable switching entirely over to using it until it is stable.
 
Thanks,
  Ken Robertson

718 Posts

January 1st, 2005 06:00

With this sort of problem I'd be looking at either a RAID controller failure or SCSI backplane.
What sort of RAID controller are you using?
If it is onboard then it will mean a possible system board swap.

January 3rd, 2005 03:00

It doesn't have a RAID controller, it just has the onboard SCSI controller.

To see if it was a hacker or anything else, I completely closed the firewall and blocked all traffic to it, and it still locked up.  Now I cannot even restart the system.  It locked up sometime Friday night, and it won't respond to any of the reset/power cycle/power off commands.

October 3rd, 2006 14:00

This thread is well over a year and a half old. The issue was resolved a very long time ago. I appreciate the help, but I'm sure there are people with recent issues who can use it more.

As for what the problem was, it was a fried CPU. The server came from Dell with hardly any thermal grease on one of the CPUs, causing it to consistently overheat and cause lock ups, eventually leading up to frying itself. Dell came out, replaced the CPU, and been working fine ever since.

Ken

80 Posts

October 3rd, 2006 14:00

Machine check errors are almost always caused by the processor being faulty or improperly seated. You can try reseating the processors and heatsinks first. If the issue persists (and the system has 2 processors) try running on only 1 processor in the primary processor socket. Alternate processor if issue persists till the faulty processor is identified.

80 Posts

October 3rd, 2006 14:00

Ken,
Make sure you have the latest firmware for the ERA. Once you have updated the ERA firmware to version 3.35 make sure you download and run the Dell Server update utility on the server. This utility will update the remainder of your drivers/firmware/BIOS if needed. Once everything is updated, update your openmanage software to the latest version.
 
As a precaution make sure you have a known good backup of your data.
 
ERA update for Dell 1750 (This update almost always resolves the types of errors you are getting)
 
 
 
Dell Server update CD (works on all dell servers running Windows. burn it to a CD, boot into windows and insert the CD. Follow onscreen instructions)
 
 
 
Newest Version of Dell OpenManage software (make sure you uninstall all previous versions of openmanage)
 
 
 

80 Posts

October 3rd, 2006 15:00

I thought it maybe resolved. However other people still read these old forum messages and my hope is this information will help them too!
 
Let me know if I can further assist.
 
-Dustin

0 events found

No Events found!

Top