Highlighted
przemo.w
1 Nickel

Openmanage server administrator not showing error

Hello,

 

Year just begun and problems starting to pile up Smiley Sad

I have noticed last week I had a error "Correctable memory error rate exceeded" for memory that I installed recently, I've contacted supplier and got advised to move memory to another slot and clear error logs... I had cleared logs first to see if problem will pop-up (especially when the weekly backup runs - the 7zip love to use a lot of RAM Smiley Very Happy ).

This Monday I opened OMSA and see "Main System Chassis" Green thick mark - I think great "maybe solar flare caused memory errors last time", but to be sure I checked Logs tab and I see there error: "Correctable memory error rate exceeded for DIMM4". I went to: System -> Main System Chassis -> Memory ,and see there everything is "GREEN", even the faulty memory at DIMM4 is showing that it OK but logs say otherwise?

memory OK despite logs

And the logs tab:

Errors in logs

So, why the OMSA don't shot that the memory in DIMM4 is faulty (like last time before I cleared logs)?

Ah forgot, software information:

General Information
Dell OpenManage Systems Management Software (32-Bit)	Version 7.4.0

Details
Common Storage Module	Version 4.4.0
Data Engine	Version 7.4.0
Hardware Application Programming Interface	Version 7.4.0
Instrumentation Service	Version 7.4.0
Apache Tomcat Webserver	Version 7.0.39
Oracle Java Runtime Environment	Version 1.7.0_21
Server Administrator Core files	Version 7.4.0 (866)
OMACS	Version 7.4.0
Instrumentation Service Integration Layer	Version 7.4.0
OMINST	Version 7.4.0
Inventory Collector	Version 7.4.0
Storage Management	Version 4.4.0
Server Administrator Common Framework	Version 7.4.0
Operating System Logging	Version 7.4.0
Remote Access Controller Managed Node	Version 7.1.0
Remote Access Controller Data Populator	Version 7.1.0
Server Instrumentation SNMP Module	Version 7.4.0
Agent for Remote Access	Version 2.0.0
Server Instrumentation WMI Module	Version 7.4.0
Tags (4)
0 Kudos
2 Replies

Re: Openmanage server administrator not showing error

Hi,

A good 1st step is to swap DIMMs as suggested by your supplier to see if the error follows the DIMM or not. I would suggest swapping with DIMM1. Also, you need to clear the SBE log. The hardware log will pick up on the SBE log and re-report the error if it is still in the memory log. Steps can be found at:

http://www.dell.com/support/article/sln131078/

I recommend making sure that the BIOS and BMC are current. Often there will updates and fixes. The updates can be found at:

http://www.dell.com/support/home/product-support/product/poweredge-2950/

You can either run diagnostics on the memory to try and trigger errors or monitor the server to see if the error comes back. If the error returns to DIMM4, the problem is most likely with the system board. If the error follow the DIMM to slot 1 the problem will be with the DIMM.

Jim Plumlee
Dell EMC, Enterprise Engineer

Get support on Twitter @DellCaresPRO

0 Kudos
przemo.w
1 Nickel

Re: Openmanage server administrator not showing error

Hello,

I swapped modules and everything was ok for about 1 month... then the warning moved to DIMM5 - looks like it went with the module.

But I did not checked Logs later because PE2950 display didn't show up anything (steady blue, no warn nor anything else...) - yes, my bad...

Few days ago I noticed orange colored display again....

Now I got nuke bomb dropped Smiley Sad "E2119 Fatal SB Mem CRC" on display, in logs move dreadful description:

Multi-bit memory errors detected on a memory device at location(s) DIMM1,DIMM2,DIMM3,DIMM4,DIMM5,DIMM6,DIMM7,DIMM8.

My supplier advise me to run server on one memory pair , and then check other pairs in firsts slots. If error continue then I have MB broken (worst case scenario).

So for now I grab backups everyday and look for a moment to get the memory swapping thing done.

 

The bios I have (form OMSA):

Version	2.7.0
Release Date	10/30/2010

And the BMC:

Name	Baseboard Management Controller
Version	2.37.00

I found that newest bios is from 2013 v2.3.1 , but it have number much lower that the current one I have flashed  - I have v2.7.0 .

BMC is up to date: v2.37.00

Ps. I checked updates with my service tag from server.

0 Kudos