This post is more than 5 years old
743 Posts
0
82061
M620 Blade Memory issues
Hi,
We are seeing the following on a blade, but the Dell logs have no errors. Any suggestions on what to check??
Thx
John Bradshaw
(It's running Readhat release 6.2 (Santiago) Kernel 2.6.32-220.13.1.el6.x86_64 (x86_)
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:sbridge: HANDLING MCE MEMORY ERROR
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:CPU 1: Machine Check Exception: 0 Bank 11: 8c000050000800c3
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:TSC 0 ADDR 795ee4000 MISC 90010001000208c PROCESSOR 0:206d7 TIME 1386127697 SOCKET 1 APIC 20
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:sbridge: HANDLING MCE MEMORY ERROR
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:CPU 1: Machine Check Exception: 0 Bank 11: 8c000050000800c3
Dec 4 14:28:17 MyMachine kern.warning<4> kernel:TSC 0 ADDR 795ee4000 MISC 90010001000208c PROCESSOR 0:206d7 TIME 1386127697 SOCKET 1 APIC 20
Dec 4 14:28:18 MyMachine kern.warning<4> kernel:EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#3_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=1 Err=0008:00c3 (ch=3), addr = 0x795ee4000 => socket=1, Channel=3(mask=8), rank=0
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
December 5th, 2013 16:00
Yes, that is correct. By having EDAC enabled you are not taking advantage of the baseboard management controller. The EDAC is intended to be used when hardware level management systems are not available. When EDAC is enabled the BMC does not log the errors. EDAC will report erroneous errors and does not always provide enough information to allow adequate troubleshooting of the problem.
It would be much better to disable EDAC and let the BMC handle error reporting and logging of the hardware. If you are running RHEL on a desktop system without a BMC/ESM then EDAC is nice tool for hardware level error reporting, but in a server it is a limiting factor.
Thanks
DELL-Chris H
Moderator
Moderator
•
8.4K Posts
0
December 4th, 2013 14:00
Bradje1,
These errors occur when the EDAC (Error Detection and Correction) module reads the registers from the chipset. The registers are read-once and when enabled, EDAC will get them first. What you will need to do is blacklist the EDAC driver, you can do this by;
# lsmod | grep -i edac
Take those and blacklist by editing the following /etc/modprobe.d/blacklist.conf and adding this to the bottom of the file -
blacklist i7core_edac
blacklist edac_core
After that then reboot and run diags to confirm if issue is resolved.
Let me know if this helps.
bradje1
743 Posts
0
December 4th, 2013 15:00
Hey Chris,
Thx for the reply.
I might not be reading this correctly, but we don't really want to stop the O/S from detecting memory errors.
Is that what the above will do?
Cheers,
John Bradshaw
bradje1
743 Posts
0
December 5th, 2013 11:00
Bump
bradje1
743 Posts
0
December 8th, 2013 11:00
Thx for the explanation Daniel. That is very helpful to know.
John Bradshaw