Unsolved
This post is more than 5 years old
3 Posts
0
3285
September 20th, 2016 04:00
Poweredge R280 DIMM error mcelogs
Hello,
I have been facing a DIMM issue in one of our Poweredge R820 servers.
The first logs was as follows:
Sep 6 08:53:59 r80-nd mcelog: STATUS 8c000041000800c0 MCGSTATUS 0
:Sep 6 08:53:59 r80-nd mcelog: MCGCAP 1000c14 APICID 40 SOCKETID 2
:Sep 6 08:53:59 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45
:Sep 6 09:00:19 r80-nd mcelog: Hardware event. This is not a software error.
:Sep 6 09:00:19 r80-nd mcelog: MCE 0
:Sep 6 09:00:19 r80-nd mcelog: CPU 2 BANK 8
:Sep 6 09:00:19 r80-nd mcelog: MISC 123180100010028c ADDR 926aaed000
:Sep 6 09:00:19 r80-nd mcelog: TIME 1473145219 Tue Sep 6 09:00:19 2016
:Sep 6 09:00:19 r80-nd mcelog: MCG status:
:Sep 6 09:00:19 r80-nd mcelog: MCi status:
:Sep 6 09:00:19 r80-nd mcelog: Corrected error
:Sep 6 09:00:19 r80-nd mcelog: MCi_MISC register valid
:Sep 6 09:00:19 r80-nd mcelog: MCi_ADDR register valid
:Sep 6 09:00:19 r80-nd mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
:Sep 6 09:00:19 r80-nd mcelog: Transaction: Memory scrubbing error
:Sep 6 09:00:19 r80-nd mcelog: MemCtrl: Corrected patrol scrub error
:Sep 6 09:00:19 r80-nd mcelog:
:Sep 6 09:00:19 r80-nd mcelog: STATUS 8c000041000800c0 MCGSTATUS 0
:Sep 6 09:00:19 r80-nd mcelog: MCGCAP 1000c14 APICID 40 SOCKETID 2
:Sep 6 09:00:19 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45
Then I re-seated all the DIMMS in the server and the errors were gone for about two weeks or so and now the errors are back but on different location as below:
:Sep 19 19:46:47 r80-nd mcelog: Hardware event. This is not a software error.
:Sep 19 19:46:47 r80-nd mcelog: MCE 0
:Sep 19 19:46:47 r80-nd mcelog: CPU 3 BANK 9
:Sep 19 19:46:47 r80-nd mcelog: MISC 918c2000240188c ADDR bbc0977000
:Sep 19 19:46:47 r80-nd mcelog: TIME 1474307207 Mon Sep 19 19:46:47 2016
:Sep 19 19:46:47 r80-nd mcelog: MCG status:
:Sep 19 19:46:47 r80-nd mcelog: MCi status:
:Sep 19 19:46:47 r80-nd mcelog: Error overflow
:Sep 19 19:46:47 r80-nd mcelog: Corrected error
:Sep 19 19:46:47 r80-nd mcelog: MCi_MISC register valid
:Sep 19 19:46:47 r80-nd mcelog: MCi_ADDR register valid
:Sep 19 19:46:47 r80-nd mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
:Sep 19 19:46:47 r80-nd mcelog: Transaction: Memory scrubbing error
:Sep 19 19:46:47 r80-nd mcelog: MemCtrl: Corrected patrol scrub error
:Sep 19 19:46:47 r80-nd mcelog:
:Sep 19 19:46:47 r80-nd mcelog: STATUS cc001a4c000800c1 MCGSTATUS 0
:Sep 19 19:46:47 r80-nd mcelog: MCGCAP 1000c14 APICID 60 SOCKETID 3
:Sep 19 19:46:47 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45
:Sep 19 19:46:47 r80-nd mcelog: Fallback Socket memory error count 104 exceeded threshold: 522 in 24h
:Sep 19 19:46:47 r80-nd mcelog: Location: SOCKET:3 CHANNEL:? DIMM:? []
So I am pretty sure now that one of the DIMMS is faulty, right?
I need help identifying which DIMM is the faulty one and needs to be removed.
Also the following errors are appearing on the server's small screen:
MEM0702 correctable memory error rate exceeded for DIMM_D2. Reseat memory
MEM0005 persistent correctable memory error limit reached for DIMM1,DIMM2,DIMM3,DIMM4
Appreciate your help!


Ramy Adly
3 Posts
0
September 21st, 2016 09:00
Hello,
Thanks for your reply.
I have already re-seating all the DIMMS and the errors now are far less periodic as I used to have a error report every 5 min but after the re-seating the server has reported only 7 errors in 2 weeks.
However the error still exists.
I need your help to identify which Dimm is affected.
I can find in the logs the following line:
CPU 3 BANK 9
I need to know where this DIMM is physically mapped in the server as I can't find anything about banks in the manual or where they are mapped in the physical server
Appreciate your help