Unsolved

This post is more than 5 years old

3285

September 20th, 2016 04:00

Poweredge R280 DIMM error mcelogs

Hello,

I have been facing a DIMM issue in one of our Poweredge R820 servers.

The first logs was as follows:

Sep  6 08:53:59 r80-nd mcelog: STATUS 8c000041000800c0 MCGSTATUS 0
:Sep  6 08:53:59 r80-nd mcelog: MCGCAP 1000c14 APICID 40 SOCKETID 2
:Sep  6 08:53:59 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45
:Sep  6 09:00:19 r80-nd mcelog: Hardware event. This is not a software error.
:Sep  6 09:00:19 r80-nd mcelog: MCE 0
:Sep  6 09:00:19 r80-nd mcelog: CPU 2 BANK 8
:Sep  6 09:00:19 r80-nd mcelog: MISC 123180100010028c ADDR 926aaed000
:Sep  6 09:00:19 r80-nd mcelog: TIME 1473145219 Tue Sep  6 09:00:19 2016
:Sep  6 09:00:19 r80-nd mcelog: MCG status:
:Sep  6 09:00:19 r80-nd mcelog: MCi status:
:Sep  6 09:00:19 r80-nd mcelog: Corrected error
:Sep  6 09:00:19 r80-nd mcelog: MCi_MISC register valid
:Sep  6 09:00:19 r80-nd mcelog: MCi_ADDR register valid
:Sep  6 09:00:19 r80-nd mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
:Sep  6 09:00:19 r80-nd mcelog: Transaction: Memory scrubbing error
:Sep  6 09:00:19 r80-nd mcelog: MemCtrl: Corrected patrol scrub error
:Sep  6 09:00:19 r80-nd mcelog:
:Sep  6 09:00:19 r80-nd mcelog: STATUS 8c000041000800c0 MCGSTATUS 0
:Sep  6 09:00:19 r80-nd mcelog: MCGCAP 1000c14 APICID 40 SOCKETID 2
:Sep  6 09:00:19 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45

Then I re-seated all the DIMMS in the server and the errors were gone for about two weeks or so and now the errors are back but on different location as below:

:Sep 19 19:46:47 r80-nd mcelog: Hardware event. This is not a software error.
:Sep 19 19:46:47 r80-nd mcelog: MCE 0
:Sep 19 19:46:47 r80-nd mcelog: CPU 3 BANK 9
:Sep 19 19:46:47 r80-nd mcelog: MISC 918c2000240188c ADDR bbc0977000
:Sep 19 19:46:47 r80-nd mcelog: TIME 1474307207 Mon Sep 19 19:46:47 2016
:Sep 19 19:46:47 r80-nd mcelog: MCG status:
:Sep 19 19:46:47 r80-nd mcelog: MCi status:
:Sep 19 19:46:47 r80-nd mcelog: Error overflow
:Sep 19 19:46:47 r80-nd mcelog: Corrected error
:Sep 19 19:46:47 r80-nd mcelog: MCi_MISC register valid
:Sep 19 19:46:47 r80-nd mcelog: MCi_ADDR register valid
:Sep 19 19:46:47 r80-nd mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
:Sep 19 19:46:47 r80-nd mcelog: Transaction: Memory scrubbing error
:Sep 19 19:46:47 r80-nd mcelog: MemCtrl: Corrected patrol scrub error
:Sep 19 19:46:47 r80-nd mcelog:
:Sep 19 19:46:47 r80-nd mcelog: STATUS cc001a4c000800c1 MCGSTATUS 0
:Sep 19 19:46:47 r80-nd mcelog: MCGCAP 1000c14 APICID 60 SOCKETID 3
:Sep 19 19:46:47 r80-nd mcelog: CPUID Vendor Intel Family 6 Model 45
:Sep 19 19:46:47 r80-nd mcelog: Fallback Socket memory error count 104 exceeded threshold: 522 in 24h
:Sep 19 19:46:47 r80-nd mcelog: Location: SOCKET:3 CHANNEL:? DIMM:? []


So I am pretty sure now that one of the DIMMS is faulty, right?

I need help identifying which DIMM is the faulty one and needs to be removed.

Also the following errors are appearing on the server's small screen:

MEM0702 correctable memory error rate exceeded for DIMM_D2. Reseat memory
MEM0005 persistent correctable memory error limit reached for DIMM1,DIMM2,DIMM3,DIMM4

Appreciate your help!

September 21st, 2016 09:00

Hello,

Thanks for your reply.

I have already re-seating all the DIMMS and the errors now are far less periodic as I used to have a error report every 5 min but after the re-seating the server has reported only 7 errors in 2 weeks.

However the error still exists.

I need your help to identify which Dimm is affected.

I can find in the logs the following line:

 CPU 3 BANK 9

I need to know where this DIMM is physically mapped in the server as I can't find anything about banks in the manual or where they are mapped in the physical server

Appreciate your help

No Events found!

Top