PE2950 Crashing

Question

Hello, I have a PE2950 that keeps crashing.  From the DRAC card I find these in the logs every time it does it.  Is this likely just some bad memory or something more sinister at work? 08/20/2011 14:25:42 CPU 1 machine check detected. 08/20/2011 14:40:08 CPU 2 is operating correctly. 08/20/2011 14:40:08 CPU 1 is operating correctly. 08/20/2011 14:39:53 CPU 2 has an internal error (IERR). 08/20/2011 14:39:53 CPU 1 has an internal error (IERR). 08/20/2011 14:39:49 Multi-bit memory errors detected on a memory device at location 08/20/2011 14:39:49 Multi-bit memory errors detected on a memory device at location   Thanks.

DELL-Steve R · Answer

Hi Justin, first thing I would try is to re-seat the processors and memory and make sure the BIOS and all firmware are up to date. If you still are still getting these errors the next step would be to run the diagnostics to isolate a part and if possible swap with a known good part. Let us know how you get on.

Steve

theflash1932 · Answer

I agree with Steve ... these errors can be caused by bad CPU's, board, or even memory, but they can also be errors in how the firmware handles problems and may possibly be resolved with firmware updates that fix the underlying causes of the errors.  You may also look at any expansion cards you have in the system, especially if they are non-certified cards.

JustinY · Answer

I re-installed all the firmware and then ran the dell diags and everything checked ok. I then ran the Dell memory test and it all tested ok. I then ran MemTest86+ for 83+ hours with no problems.

In the documentation it says multi-bit memory error could be on the raid card.

support.dell.com/.../document.aspx

That might make more sense because all the other diags tested ok. The only disk tests it seems to run are the physical medium it does not seem to do a very thorough test on the raid card.

JM99 · Answer

I am having same issue. Our server is in "not as cool" environment, so I am wondering if I also may be seeing some borderline Temperature issues. Error shown in the log:

"Multi-bit memory errors detected on a memory device at location Memory Board A"

First off, where is a good diagram that says "Here is [Board A]" (for the 2950 servers)?

Secondly, if it is the 'board,' then that likely means 'Motherboard,' if I understand correctly; right?

Thirdly, if it happens to be a RAID issue, is it the Internal RAID (Perc) or the external (SAS)?

I have:

2950 Server, with Windows Server 2008 R2 Enteprise, SP1

Server 2008 R2; Exchange 2010 SP2.

I have external RAID, PERC 5/E Adapter (all latest firmware/drivers) connected to MD-1000 array

Internal RAID, PERC 5/i Integrated - all latest firmware/drivers

All firmware, BIOS, etc. is as up-to-date as possible.

This only started about a week ago, since some storms - so, wondering also if something might have taken a 'jolt.'

In my case, this is a twin DAG Exchange server, across fiber at another location.

OMSA log shows: Multi-bit memory errors detected on a memory device at location Memory Board A.

Will re-seat everything and will run diags, but I'm guessing they will turn out fine.

Any help and insights appreciated.

We do NOT have support - so any call would be an expenise "per-call" basis

theflash1932 · Answer

'We do NOT have support - so any call would be an expenise 'per-call' basis' Not if you are in the States or Canada; support is always free, regardless of warranty status. In addition to reseating and diags (a very prudent set of first steps), you said your BIOS was up-to-date, but is your ESM firmware up-to-date as well?  It is often overlooked when hardware is updated and is the the device responsible for monitoring and reporting hardware errors. Technically, there is nothing called 'memory board A' ... if that is the exact wording you are seeing (does it not give a bank or DIMM?), then I might assume it is lettering the slots instead of numbering them (although you could be right in that it might simply be referencing the motherboard):  they are numbered from the slot closest to the processor to the edge as follows: 1,5,2,6,3,7,4,8 (alternating slots 1-8). The message (unless it states RAID/ROMB) will not have anything to do with the RAID controller(s).

PowerEdge Hardware General

Was this post helpful?