Unsolved
This post is more than 5 years old
2 Posts
0
2382
December 13th, 2017 11:00
PowerEdge R910 MCE Log Memory Errors
I've been trying to figure out which DIMM is currently throwing the error. CPU1 BANK 8 doesn't correlate to any known location.
The mce log in /var/log/messages
Dec 13 00:23:49 compute03 kernel: mce: [Hardware Error]: Machine check events logged
Dec 13 00:23:49 compute03 mcelog: Hardware event. This is not a software error.
Dec 13 00:23:49 compute03 mcelog: MCE 0
Dec 13 00:23:49 compute03 mcelog: CPU 1 BANK 8
Dec 13 00:23:49 compute03 mcelog: TIME 1513142629 Wed Dec 13 00:23:49 2017
Dec 13 00:23:49 compute03 mcelog: MCG status:
Dec 13 00:23:49 compute03 mcelog: MCi status:
Dec 13 00:23:49 compute03 mcelog: Uncorrected error
Dec 13 00:23:49 compute03 mcelog: Error enabled
Dec 13 00:23:49 compute03 mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Dec 13 00:23:49 compute03 mcelog: Transaction: Memory read error
Dec 13 00:23:49 compute03 mcelog: STATUS b00000000800009f MCGSTATUS 0
Dec 13 00:23:49 compute03 mcelog: MCGCAP 1000c18 APICID 40 SOCKETID 1
Dec 13 00:23:49 compute03 mcelog: CPUID Vendor Intel Family 6 Model 47
Dec 13 00:23:49 compute03 mcelog: Hardware event. This is not a software error.
Dec 13 00:23:49 compute03 mcelog: MCE 0
Dec 13 00:23:49 compute03 mcelog: CPU 0 BANK 1 TSC e02613552c8d8
Dec 13 00:23:49 compute03 mcelog: ADDR 28d2fa6300
Dec 13 00:23:49 compute03 mcelog: TIME 1513142629 Wed Dec 13 00:23:49 2017
Dec 13 00:23:49 compute03 mcelog: MCG status:
Dec 13 00:23:49 compute03 mcelog: MCi status:
Dec 13 00:23:49 compute03 mcelog: Corrected error
Dec 13 00:23:49 compute03 mcelog: Error enabled
Dec 13 00:23:49 compute03 mcelog: MCi_ADDR register valid
Dec 13 00:23:49 compute03 mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Dec 13 00:23:49 compute03 mcelog: Transaction: Memory read error
Dec 13 00:23:49 compute03 mcelog: STATUS 940000000000009f MCGSTATUS 0
Dec 13 00:23:49 compute03 mcelog: MCGCAP 1000c18 APICID 0 SOCKETID 0
Dec 13 00:23:49 compute03 mcelog: CPUID Vendor Intel Family 6 Model 47
Dec 13 00:25:38 compute03 kernel: mce: [Hardware Error]: Machine check events logged
Any help would be much appreciated.
0 events found


DELL-Josh Cr
Moderator
•
9.6K Posts
•
42.1K Points
0
December 13th, 2017 14:00
Hi,
Do the hardware event logs show any errors on the system LCD, in the iDRAC or in openmanage server administrator?
kyle_jhu
2 Posts
0
December 14th, 2017 08:00
I've noticed if the amount of errors show in MCE logs, OMSA/iDRAC will give me the correct DIMM. But when the errors are few, OMSA/iDRAC do not show any errors. I have no clear way of figuring out which DIMM is having issues.
Is there a error threshold within iDRAC and OMSA? Basically will it report every single bit error or does it have to hit a certain amount of errors before iDRAC and OMSA report a single/multi bit errors?
The MCE errors do not happen often, but with production systems I need to know when DIMMS are having problems. Sure, a certain amount of errors are okay (interference from cosmic rays, power lines, etc), but I'd truly like to be able to correlate the MCE errors with a specific DIMM that threw the error.
To answer your question. If there are a lot of errors in mcelog, yes OMSA/iDRAC notifies me.
DELL-Josh Cr
Moderator
•
9.6K Posts
•
42.1K Points
0
December 14th, 2017 09:00
It reports errors after it hits a threshold and then every error after that until the logs are cleared. Whatever dimm the hardware logs show the error on swap that dim to a different processor and see if the error moves. If they are infrequent it may be hard to track down.