Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

82061

December 4th, 2013 13:00

M620 Blade Memory issues

Hi,

We are seeing the following on a blade, but the Dell logs have no errors. Any suggestions on what to check??

Thx

John Bradshaw

(It's running Readhat release 6.2 (Santiago) Kernel 2.6.32-220.13.1.el6.x86_64 (x86_)

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:sbridge: HANDLING MCE MEMORY ERROR

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:CPU 1: Machine Check Exception: 0 Bank 11: 8c000050000800c3

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:TSC 0 ADDR 795ee4000 MISC 90010001000208c PROCESSOR 0:206d7 TIME 1386127697 SOCKET 1 APIC 20

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:sbridge: HANDLING MCE MEMORY ERROR

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:CPU 1: Machine Check Exception: 0 Bank 11: 8c000050000800c3

Dec  4 14:28:17 MyMachine kern.warning<4> kernel:TSC 0 ADDR 795ee4000 MISC 90010001000208c PROCESSOR 0:206d7 TIME 1386127697 SOCKET 1 APIC 20

Dec  4 14:28:18 MyMachine kern.warning<4> kernel:EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#3_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=1 Err=0008:00c3 (ch=3), addr = 0x795ee4000 => socket=1, Channel=3(mask=8), rank=0

Moderator

 • 

6.2K Posts

December 5th, 2013 16:00

I might not be reading this correctly, but we don't really want to stop the O/S from detecting memory errors.

Is that what the above will do?

Yes, that is correct. By having EDAC enabled you are not taking advantage of the baseboard management controller. The EDAC is intended to be used when hardware level management systems are not available. When EDAC is enabled the BMC does not log the errors. EDAC will report erroneous errors and does not always provide enough information to allow adequate troubleshooting of the problem.

It would be much better to disable EDAC and let the BMC handle error reporting and logging of the hardware. If you are running RHEL on a desktop system without a BMC/ESM then EDAC is nice tool for hardware level error reporting, but in a server it is a limiting factor.

Thanks

Moderator

 • 

8.4K Posts

December 4th, 2013 14:00

Bradje1,

These errors occur when the EDAC (Error Detection and Correction) module reads the registers from the chipset. The registers are read-once and when enabled, EDAC will get them first. What you will need to do is blacklist the EDAC driver, you can do this by;

# lsmod | grep -i edac

Take those and blacklist by editing the following  /etc/modprobe.d/blacklist.conf and adding this to the bottom of the file - 

blacklist i7core_edac

blacklist edac_core

After that then reboot and run diags to confirm if issue is resolved.

Let me know if this helps.

743 Posts

December 4th, 2013 15:00

Hey Chris,

Thx for the reply.

I might not be reading this correctly, but we don't really want to stop the O/S from detecting memory errors.

Is that what the above will do?

Cheers,

John Bradshaw

743 Posts

December 5th, 2013 11:00

Bump

743 Posts

December 8th, 2013 11:00

Thx for the explanation Daniel. That is very helpful to know.

John Bradshaw

No Events found!

Top