Year just begun and problems starting to pile up
I have noticed last week I had a error "Correctable memory error rate exceeded" for memory that I installed recently, I've contacted supplier and got advised to move memory to another slot and clear error logs... I had cleared logs first to see if problem will pop-up (especially when the weekly backup runs - the 7zip love to use a lot of RAM ).
This Monday I opened OMSA and see "Main System Chassis" Green thick mark - I think great "maybe solar flare caused memory errors last time", but to be sure I checked Logs tab and I see there error: "Correctable memory error rate exceeded for DIMM4". I went to: System -> Main System Chassis -> Memory ,and see there everything is "GREEN", even the faulty memory at DIMM4 is showing that it OK but logs say otherwise?
And the logs tab:
So, why the OMSA don't shot that the memory in DIMM4 is faulty (like last time before I cleared logs)?
Ah forgot, software information:
General Information Dell OpenManage Systems Management Software (32-Bit) Version 7.4.0 Details Common Storage Module Version 4.4.0 Data Engine Version 7.4.0 Hardware Application Programming Interface Version 7.4.0 Instrumentation Service Version 7.4.0 Apache Tomcat Webserver Version 7.0.39 Oracle Java Runtime Environment Version 1.7.0_21 Server Administrator Core files Version 7.4.0 (866) OMACS Version 7.4.0 Instrumentation Service Integration Layer Version 7.4.0 OMINST Version 7.4.0 Inventory Collector Version 7.4.0 Storage Management Version 4.4.0 Server Administrator Common Framework Version 7.4.0 Operating System Logging Version 7.4.0 Remote Access Controller Managed Node Version 7.1.0 Remote Access Controller Data Populator Version 7.1.0 Server Instrumentation SNMP Module Version 7.4.0 Agent for Remote Access Version 2.0.0 Server Instrumentation WMI Module Version 7.4.0
A good 1st step is to swap DIMMs as suggested by your supplier to see if the error follows the DIMM or not. I would suggest swapping with DIMM1. Also, you need to clear the SBE log. The hardware log will pick up on the SBE log and re-report the error if it is still in the memory log. Steps can be found at:
I recommend making sure that the BIOS and BMC are current. Often there will updates and fixes. The updates can be found at:
You can either run diagnostics on the memory to try and trigger errors or monitor the server to see if the error comes back. If the error returns to DIMM4, the problem is most likely with the system board. If the error follow the DIMM to slot 1 the problem will be with the DIMM.
Dell EMC, Enterprise Engineer
Get support on Twitter @DellCaresPRO
I swapped modules and everything was ok for about 1 month... then the warning moved to DIMM5 - looks like it went with the module.
But I did not checked Logs later because PE2950 display didn't show up anything (steady blue, no warn nor anything else...) - yes, my bad...
Few days ago I noticed orange colored display again....
Now I got nuke bomb dropped "E2119 Fatal SB Mem CRC" on display, in logs move dreadful description:
Multi-bit memory errors detected on a memory device at location(s) DIMM1,DIMM2,DIMM3,DIMM4,DIMM5,DIMM6,DIMM7,DIMM8.
My supplier advise me to run server on one memory pair , and then check other pairs in firsts slots. If error continue then I have MB broken (worst case scenario).
So for now I grab backups everyday and look for a moment to get the memory swapping thing done.
The bios I have (form OMSA):
Version 2.7.0 Release Date 10/30/2010
And the BMC:
Name Baseboard Management Controller Version 2.37.00
I found that newest bios is from 2013 v2.3.1 , but it have number much lower that the current one I have flashed - I have v2.7.0 .
BMC is up to date: v2.37.00
Ps. I checked updates with my service tag from server.