Unsolved
1 Rookie
•
21 Posts
0
388
April 15th, 2024 22:44
Memory Channel Error Identification on PowerEdge R6515
Hello
I recently bought 16x Dell Part AA783423 as part of a memory upgrade but one of the sticks seem bad.
I am experiencing memory errors on my Dell PowerEdge R6515 server and need assistance with identifying the problematic DIMM slot. The edac-util tool reports errors specifically at "mc#0csrow#3channel#2"
edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#2: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#3: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#4: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#5: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#2: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#3: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#4: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#5: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#2: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#3: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#4: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#5: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#2: 74 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#3: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#4: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#5: 0 Corrected Errors
. I would like guidance on which DIMM slot corresponds to this particular memory channel. Could you provide me with the memory layout or any specific documentation that could help me isolate and address this issue? Any additional troubleshooting steps or advice would also be appreciated.
Thank you for your assistance.
No Events found!


DELL-Joey C
Moderator
•
4.2K Posts
•
20.9K Points
0
April 16th, 2024 03:25
Hi,
It is hard to identify EDAC error message, as we need to refer to architectural schemetics, usually this would need engineering to be involved.
I would suggest, disabling EDAC and let the server's lifecycle controller capture the error, this would be an easier and faster way. These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.
SDeltaE
1 Rookie
•
21 Posts
0
April 16th, 2024 10:44
@DELL-Joey C Hello
I disabled the EDAC, rmmod amd64_edac edac_mce_amd
Now dmesg prints:
Checking the IDRAC lifecycle log I'm not seeing anything picked up there
DELL-Erman O
Moderator
•
3K Posts
•
14.9K Points
0
April 16th, 2024 12:37
Hello,
If there is nothing on iDRAC LCC log then it's hard to say there is a memory error. EDAC Errors in 'messages' Log in RedHat Enterprise Linux (RHEL) and PowerEdge | Dell
Resolution
Hope that helps!
SDeltaE
1 Rookie
•
21 Posts
0
July 9th, 2024 08:19
@DELL-Erman O Hello. I've attempted this and now dmesg -T doesn't print any more errors related to memory. But IDRAC 9 also doesn't show any errors. I've attempted to run memtester software with 1 pass for 26 hours but nothing gets picked up. I think it's because of ECC maybe correcting the errors.
The system isn't stable :( As soon as I start it, and ram usage goes up, applications start freezing and dmesg -T shows this.
I attempted to remove all 16x Dell Part AA783423 I installed and go back to the old 16 x 32 GB sticks. When doing so all issues would go away. I'm thinking these cpu locking / freezing issues is because of that one bad ram stick that I still can't find!
Could you give me any other advice on how to best solve this issue other than testing by removing 1-2 sticks at a time until issue is gone because I live far away from the DC and this would be really really hard on me.
DELL-Marco B
Moderator
•
4K Posts
0
July 9th, 2024 08:29
Hello,
this is very hard to diagnose as the only way is to try to isolate the memory bank that is faulty.
Also of course I suggest you to keep the bios and idrac up to date.
Thanks
SDeltaE
1 Rookie
•
21 Posts
0
July 9th, 2024 08:51
@DELL-Marco B Yes :(
Is there any specific way that I should do it? Would you know of any specific command to run that is very memory intensive in Linux that would easily replicate the issue? Or would I just go to the datacenter, plugin say 8x Dell Part AA783423, wait a day and see if issue comes back?
DELL-Marco B
Moderator
•
4K Posts
0
July 9th, 2024 10:23
Which CPU is installed?
This memory bank are not compatible with Skylake CPU
Dell 64GB Ram Memory Upgrade - DDR4; 3200MHz (Cascade Lake, Ice Lake & AMD CPU only) | Dell USA | Dell USA
SDeltaE
1 Rookie
•
21 Posts
0
July 9th, 2024 10:42
@DELL-Marco B
AMD EPYC 7R32
(edited)
DELL-Marco B
Moderator
•
4K Posts
0
July 9th, 2024 16:57
you can use
Running Stress Tests in SLI Support Live Image. | Dell US