Start a Conversation

Unsolved

This post is more than 5 years old

G

1351659

July 6th, 2008 04:00

PowerEdge 2900 E1410 CPU IERR was asserted

I got a error mesaage in the front panel of PE2900 E1410 CPU1 was asserted.(CPU1 Status: Processor sensor for CPU1, IERR was asserted).  The problem is fixed by restarting the computer yesterday.  However it happened again today and could not boot any more.  Is there any one could help me..

 

PowerEdge 2900

BIOS: 2.2.6

DRAC 5 A00 1.32 (07.12.22)

145 Posts

July 6th, 2008 20:00

You have a hardware failure - most likely CPU 1, though the fault could be elsewhere (motherboard, perhaps). The Hardware Owners Manual says that this error occurs when the specified CPU (CPU 1 in your case) reports an internal error.

 

 

If you have a hardware warranty, call Dell in the morning.

 

 

If you don't have a hardware warranty, then you are on your own.

 

If this is a single CPU machine, you will have to try to could try to obtain a replacement CPU, which will be much easier with it being a single CPU machine, as you won't be trying to match the specification of an existing CPU. However, you must keep to the processor compatibility of your machine - the 5400 series processors only work in 2900 III machines, and 5300 series processors only work in 2900 II and III machines. I believe support for the older processors was dropped in the newer machines.

 

If this is a dual CPU machine, you could remove and store CPU 1 and move CPU 2 to the CPU 1 socket to see if the machine will now boot.

8 Posts

July 6th, 2008 23:00

Hi David,

 

Thanks for your reply.  I brought this machine at Feb 2008, so I believe it is under maintenace.  I'm not sure if it's hardware failure.  The error message showed when the room tempture is higher than 30C.  However it works fine at 28C.  Is it a normal condition?

 

Jaguar 

1.2K Posts

July 7th, 2008 22:00

No, it is not a normal condition. It is a legitimate error.

8 Posts

July 7th, 2008 23:00

I also believe there are something wrong in PE2900.  However I call the servcie guy in Dell, they told me that it's only becasue of the higher temperature and everything is ok.  Is there any other tools that I could check the potential error of PE2900?  Please advise.

 

Thanks for your help.

 

Jaguar 

145 Posts

July 8th, 2008 14:00

This error occurs when the CPU asserts its Internal ERRor (IERR) pin.

 

There are several possible reasons for this. There's a couple of listed processor errata that can cause this, but your failure to boot in what are not unreasonable ambient temperatures is of concern.

 

You could adopt a 'watch and wait' approach. However, if it happens again, I'd want an engineer out certainly to reseat the heatsink for that CPU, and really to change the CPU for another. As snapohead says, it's a real and legitimate error.

8 Posts

July 11th, 2008 07:00

PE2900 died today after I got a message E122C this morning (System Board CPU Power Fault: Voltage sensor for System Board, state asserted was asserted).  It could not be started anymore.  Dell engineer will replace a new CPU and MB.  Becasue of the weekend, I need to wait until next Mon.  :( My question is why it failure?  Do you have any experience for that?

 

Regards,

 

Jaguar

 

145 Posts

July 11th, 2008 12:00

Hardware failure can easily be just one of those things. It could be that something like the voltage regulation components for the CPU failed on the motherboard; the first sign of this might be some sort of blip that caused the IERR.

 

A new motherboard and CPU should get you going - how unfortunate it failed on a Friday so that your service call isn't until Monday.

8 Posts

July 11th, 2008 13:00

Dear David,

 

Many thanks for your recent comments.  I just a little depression, because I thought the PE2900 should be very stable.  Is there any suggetion about the environment for PE2900?  I do not want to have a hardware failure again.  Or any suggetions for the redundant system? I do nothing without this server. :(  

 

Thanks again for your help.

 

Jaguar

 

145 Posts

July 12th, 2008 10:00

It's probably just "one of those things" - you could join the service engineer on Monday to see if there's an obvious component damage on the motherboard near the CPUs. I wouldn't be surprised if there's nothing to see, though.

 

There is a fair amount of redundancy in these servers (typically they have RAID disks and redundant power supplies), but it isn't possible to duplicate everything.

1.2K Posts

July 14th, 2008 04:00

You could buy another server and set them up in a cluster for increased redundancy.

145 Posts

July 14th, 2008 10:00


@snapohead wrote:
You could buy another server and set them up in a cluster for increased redundancy.

Indeed! Those single points of failure will always catch you out.

 

If you are going to cluster, the greater the separation between the two servers, the better. For example, if you can connect the second server to different network switches, you've added redundancy against a network switch going down. The same applies for power and UPSes. It's also best, if possible, to put the two servers in different locations in case there's a problem that affects a single room or single rack.

 

Of course, the amount of separation may be limited by your software - as with so many things, it's a balancing act.

 

 

Hopefully you're back up and running after the service engineer's call today.

8 Posts

July 14th, 2008 11:00

It works fine after replacing MB, CPU, MEM.  One of the MEM also failure and could not be detected after replacing a new MB.  Because they also brought the PERC/i and DRAC, I asked them to replace all to make sure everthing is ok.  The service of Dell is good.  Only because of the weekend, I could get the support until Mon.  And also I appreciate your comments for my problem. :)

 

 

Jaguar 

145 Posts

July 14th, 2008 16:00

Great - it's good to hear that all went well with the engineer call. It sounds like something fairly significant went wrong to need a new CPU, motherboard and at least one stick of memory. I agree that it was wise to replace as much as possible.

 

By the way - though it's really an academic question - does your 2900 have one CPU or two?

 

 

Hopefully that's the end of the problems. If not, I'd be pushing to get the power supply (or power supplies) replaced. If there are any further problems, I'd also check out your power to the machine.

 

However, I suspect it was one of those things, and all will be well now.

8 Posts

July 14th, 2008 22:00

I have only one CPU, 4G RAM (2x2G), 500G RAID 0 (2 SATA HDD).  I'm interest in why it is a academic question?

 

BTW, I will continuous to monitor, if it still problem, I will use your suggestion to replace my power supply.   Of course I really hope it is the end of problem.  Thanks again for you help.

 

 

Jaguar 

145 Posts

July 15th, 2008 17:00

One tip for stability's sake - if you really meant RAID 0 and it wasn't a typo, I'd aim to discontinue the use of RAID 0 as soon as possible. RAID 0 doesn't give you any redundancy and isn't, in my opinion, suitable for server use outside specialist applications that have a reason for needing it (such as some video editing setups). RAID 0 is less robust than a single hard disk - if either disk fails, all the array data is lost.

 

A pair of 500GB hard disks are relatively inexpensive - that will allow you to have a 500GB RAID 1. If either drive fails, the array will keep on working with a single drive until you replace the failed drive to restore redundancy. That said, RAID 1 is not a substitute for proper backups.

No Events found!

Top