Steps on how to confirm and troubleshoot DIMM errors on a Cisco C-Series Server

Summary: Steps on how to confirm and troubleshoot DIMM errors on a Cisco C-Series Server

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

How to Clear DIMM Errors for UCS C-Series Server

Duration: 00:01:29 (hh:mm:ss)
When available, closed caption (subtitles) language settings can be chosen using the CC icon on this video player.

Facts

  • Cisco C-Series Rack Mounted Servers (May or may not be managed by UCSM)


Symptoms

  •  Alerts will show up in CIMC or UCSM, such as:

F0184 
F0185
F0137
F1236
F1237

  • PSOD – Purple Screen of death (on KVM or console of host)


Solution

Log collection
 
Capture logs from the affected server BEFORE any troubleshooting is done. We need a baseline to determine the success of troubleshooting steps.

C-Series Rack servers can either be standalone or managed by UCSM.  The steps to gather and review the logs will be slightly different depending on which it is.

  • Standalone.
  • Managed by UCSM - Select “Rack Mount” instead of “chassis” or “ucsm” in the Options field
  • If you have only CIMC logs, you can tell that they are from a UCSM-managed server because the file name will contain CIMCXXX.  The log files will also be in a zipped directory called Server XX, instead of directly in the main zipped directory.  If you see this, UCSM logs will be required as well.

If the server experienced a PSOD, take a screenshot of the PSOD as well as gather vSphere/host logs.

Log analysis

  The main differences between logs are the

  • Additional information is available in the UCSM sam_techsupport file for UCSM managed servers
  • Location of the directories. (see note under log collection)

Helpful log locations in UCSM and CIMC logs:

UCSM_X_TechSupport.tar\sam_techsupportinfo

  • `show server inventory expand` (confirm server serial number, locate PID). Example:
Server 1:
     Model: UCSC-C220-M4S
     Acknowledged Serial (SN): FCHXXXXXXXXXX
     Acknowledged Product Name: Cisco UCS C220 M4S
     Acknowledged PID: UCSC-C220-M4S
  • `show fault detail` (locate faults associated) - Example:
Severity: Major
Code: F0844
Last Transition Time: 2017-05-23T12:40:40.774
Description: DIMM DIMM_B2 on server 24 operaState: disabled
  • `show server memory detail` (locate impacted DIMM PID)- Example:
Location: DIMM_A1
Product Name: 16GB DDR4-2400-MHz RDIMM/PC4-19200/single rank/x4/1.2v
PID: UCS-MR-xxxxxxxx-A

Note – most of this information is available in sam_techsupport for UCSM-managed servers

 [ServerXX_TechSupport.tar]\tmp\ ServerXX_TechSupport.txt

  • Chassis Info Area

Find server Serial number listed as “Chassis Serial Number”. Example as follows: 

====================[  Chassis Info Area  ]======================
            Chassis Part Num                  : [74-xxxxx-02]
            Chassis Serial Num                : [FCHXXXXXXXXX]

 

  • Board Area

Find Motherboard PID and serial number. Example as follows: 

========================[  Board Area  ]=========================
            Board Product Name                : [UCSC-C240-Mxxxx]
            Board Serial Number               : [FCHXXXXXXXX]

 

  • SMBIOS Table Dump BEGIN    

                Find DIMM Part number under Memory Device\Part Locator. Example as follows: 
                Note: this may not be the Cisco PID, but can be correlated to find it

Memory Device
           Locator: DIMM_A1
           Part Number: 36ASxxxxxx-2G3B1

  Querying All IPMI Sensors section:

Correctable and uncorrectable Errors:
Sensor Name              | Reading | Unit         | Status | LNR     | LC      | LNC     | UNC     | UC      | UNR    
DDR4_P2_E1_ECC   | 63250.000 | error        | UNR    | na      | na      | na      | na      | na      | 60250.000 DDR4_P2_E2_ECC   | 63750.000 | error        | UNR    | na      | na      | na      | na      | na      | 60250.000
DDR4_P2_E3_ECC   | 63250.000 | error        | UNR    | na      | na      | na      | na      | na      | 60250.000



[ServerXX_TechSupport.tar]\var\log\sel\log

  • Review the logs for any Correctable and Uncorrectable ECC Errors:
Memory DDR4_P2_E2_ECC #0xb0 | read 512 correctable ECC errors on CPU2 DIMM E2  | Asserted
  • Review the logs for any CATERR_N … Asserted | Asserted entries, an example is as follows:
03/06/2017 20:02:12 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted

  Note: it is expected behavior to see CATERR_N de-asserted | Asserted in the logs at boot time [ServerXX_TechSupport.tar]\var\DIMM-BL_Status.txt

  • Find correctable/uncorrectable error counts for impacted DIMM(s) and copy the relevant fields, an example is as follows:
================== SUMMARY OF DIMM ERRORS ===================
------- DIMM  E2 ----------
  CURRENT SLOT ERROR COUNTS :
      Correctable ECC Errors Since Last Server Boot   : 0
      Cummulative Correctable ECC Error Count         : 2560
      Uncorrectable ECC Errors Since Last Server Boot : 0
      Cummulative Uncorrectable ECC Error Count       : 3
   PREVIOUS SLOT ERROR COUNTS :
      Correctable ECC Error Count         : 0
      Uncorrectable ECC Error Count       : 0


[ServerXX_TechSupport.tar]\var\sel_decode.txt

  • Play by play of sel entries and faults
eventLogMaxEntries: 1445
eventLogList: 
---
Id: 1440
severity: Critical
dateTime: 2017-03-10 00:57:17 
dateTimeOrder: 00005
description: "System Software event: Post sensor, DIMM socket 3, Channel E, Processor Socket 2. disabled due to other memory failed in same channel. [0xE542] was asserted"


For standalone servers:

  • tmp\tech_support.frupids
====== Dumping IPMI FRU Records ======
Product Name: UCSC-C220-xxx
Product Part Number: 74-xxxx-01
Product Version: A
 Product Serial: FCHxxxxxxxN – Server Serial Number

====== Dumping Inventory Catalog PIDs ======​
DIMMList: 
Name: DIMM_A1Description: 8GB DDR3-1333-MHz RDIMM/PC3-10600/dual rank/1.35v
PID: UCS-MR-1X082RX-A – DIMM PID



Post-Analysis
After performing analysis, ensure that the service request is updated with the correct serial number of the impacted server, and the database is searched for any previous RMA’s associated with the blade being investigated.  If the DIMM showing faults was replaced recently, the motherboard may be suspect.

Add your analysis to the service request.

Logical Troubleshooting
 
Once errors are identified, we will attempt to clear them all and monitor counters and the faults tab in UCSM to see if they persist.
Login to the Server command line.

Clear memory error counters

server# scope chassis
server /chassis # reset-ecc

Clear System Event Logs the commands below:

Server# scope sel
Server /sel # clear 
This operation will clear the whole sel.
Continue?[y|N]y


 Reset the CIMC log using the commands below:

Server# scope cimc
Server /cimc # scope log
Server /cimc/log # clear


Monitor the environment for 48 hours.
If errors persist, capture a fresh set of UCS and Chassis logs, confirm analysis, formulate an action plan based on the evidence, and proceed to the next section.

Cause

  • DIMM errors are usually caused by a faulty DIMM, or sometimes by a bad Motherboard


Notes

  • None

Additional Information

Please refer to this video:

Affected Products

Converged Infrastructure, Converged Systems, VxBlock and Vblock Systems, VxBlock and vBlock Systems Series
Article Properties
Article Number: 000194450
Article Type: How To
Last Modified: 07 Jan 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.