Steps on how to confirm and troubleshoot DIMM errors on a Cisco C-Series Server
Summary: Steps on how to confirm and troubleshoot DIMM errors on a Cisco C-Series Server
Instructions
Facts
- Cisco C-Series Rack Mounted Servers (May or may not be managed by UCSM)
Symptoms
- Alerts will show up in CIMC or UCSM, such as:
F0184
F0185
F0137
F1236
F1237
- PSOD – Purple Screen of death (on KVM or console of host)
Solution
Log collection
Capture logs from the affected server BEFORE any troubleshooting is done. We need a baseline to determine the success of troubleshooting steps.
C-Series Rack servers can either be standalone or managed by UCSM. The steps to gather and review the logs will be slightly different depending on which it is.
- Standalone.
- Managed by UCSM - Select “Rack Mount” instead of “chassis” or “ucsm” in the Options field
- If you have only CIMC logs, you can tell that they are from a UCSM-managed server because the file name will contain CIMCXXX. The log files will also be in a zipped directory called Server XX, instead of directly in the main zipped directory. If you see this, UCSM logs will be required as well.
If the server experienced a PSOD, take a screenshot of the PSOD as well as gather vSphere/host logs.
Log analysis
The main differences between logs are the
- Additional information is available in the UCSM sam_techsupport file for UCSM managed servers
- Location of the directories. (see note under log collection)
Helpful log locations in UCSM and CIMC logs:
UCSM_X_TechSupport.tar\sam_techsupportinfo
- `show server inventory expand` (confirm server serial number, locate PID). Example:
Server 1:
Model: UCSC-C220-M4S
Acknowledged Serial (SN): FCHXXXXXXXXXX
Acknowledged Product Name: Cisco UCS C220 M4S
Acknowledged PID: UCSC-C220-M4S
- `show fault detail` (locate faults associated) - Example:
Severity: Major Code: F0844 Last Transition Time: 2017-05-23T12:40:40.774 Description: DIMM DIMM_B2 on server 24 operaState: disabled
- `show server memory detail` (locate impacted DIMM PID)- Example:
Location: DIMM_A1 Product Name: 16GB DDR4-2400-MHz RDIMM/PC4-19200/single rank/x4/1.2v PID: UCS-MR-xxxxxxxx-A
Note – most of this information is available in sam_techsupport for UCSM-managed servers
[ServerXX_TechSupport.tar]\tmp\ ServerXX_TechSupport.txt
- Chassis Info Area
Find server Serial number listed as “Chassis Serial Number”. Example as follows:
====================[ Chassis Info Area ]======================
Chassis Part Num : [74-xxxxx-02]
Chassis Serial Num : [FCHXXXXXXXXX]
- Board Area
Find Motherboard PID and serial number. Example as follows:
========================[ Board Area ]=========================
Board Product Name : [UCSC-C240-Mxxxx]
Board Serial Number : [FCHXXXXXXXX]
- SMBIOS Table Dump BEGIN
Find DIMM Part number under Memory Device\Part Locator. Example as follows:
Note: this may not be the Cisco PID, but can be correlated to find it
Memory Device
Locator: DIMM_A1
Part Number: 36ASxxxxxx-2G3B1
Querying All IPMI Sensors section:
Correctable and uncorrectable Errors: Sensor Name | Reading | Unit | Status | LNR | LC | LNC | UNC | UC | UNR DDR4_P2_E1_ECC | 63250.000 | error | UNR | na | na | na | na | na | 60250.000 DDR4_P2_E2_ECC | 63750.000 | error | UNR | na | na | na | na | na | 60250.000 DDR4_P2_E3_ECC | 63250.000 | error | UNR | na | na | na | na | na | 60250.000
[ServerXX_TechSupport.tar]\var\log\sel\log
- Review the logs for any Correctable and Uncorrectable ECC Errors:
Memory DDR4_P2_E2_ECC #0xb0 | read 512 correctable ECC errors on CPU2 DIMM E2 | Asserted
- Review the logs for any CATERR_N … Asserted | Asserted entries, an example is as follows:
03/06/2017 20:02:12 | CIMC | Processor CATERR_N #0x70 | Predictive Failure asserted | Asserted
Note: it is expected behavior to see CATERR_N de-asserted | Asserted in the logs at boot time [ServerXX_TechSupport.tar]\var\DIMM-BL_Status.txt
- Find correctable/uncorrectable error counts for impacted DIMM(s) and copy the relevant fields, an example is as follows:
================== SUMMARY OF DIMM ERRORS =================== ------- DIMM E2 ---------- CURRENT SLOT ERROR COUNTS : Correctable ECC Errors Since Last Server Boot : 0 Cummulative Correctable ECC Error Count : 2560 Uncorrectable ECC Errors Since Last Server Boot : 0 Cummulative Uncorrectable ECC Error Count : 3 PREVIOUS SLOT ERROR COUNTS : Correctable ECC Error Count : 0 Uncorrectable ECC Error Count : 0
[ServerXX_TechSupport.tar]\var\sel_decode.txt
- Play by play of sel entries and faults
eventLogMaxEntries: 1445 eventLogList: --- Id: 1440 severity: Critical dateTime: 2017-03-10 00:57:17 dateTimeOrder: 00005 description: "System Software event: Post sensor, DIMM socket 3, Channel E, Processor Socket 2. disabled due to other memory failed in same channel. [0xE542] was asserted"
For standalone servers:
- tmp\tech_support.frupids
====== Dumping IPMI FRU Records ====== Product Name: UCSC-C220-xxx Product Part Number: 74-xxxx-01 Product Version: A Product Serial: FCHxxxxxxxN – Server Serial Number ====== Dumping Inventory Catalog PIDs ====== DIMMList: Name: DIMM_A1Description: 8GB DDR3-1333-MHz RDIMM/PC3-10600/dual rank/1.35v PID: UCS-MR-1X082RX-A – DIMM PID
Post-Analysis
After performing analysis, ensure that the service request is updated with the correct serial number of the impacted server, and the database is searched for any previous RMA’s associated with the blade being investigated. If the DIMM showing faults was replaced recently, the motherboard may be suspect.
Add your analysis to the service request.
Logical Troubleshooting
Once errors are identified, we will attempt to clear them all and monitor counters and the faults tab in UCSM to see if they persist.
Login to the Server command line.
Clear memory error counters
server# scope chassis server /chassis # reset-ecc
Clear System Event Logs the commands below:
Server# scope sel Server /sel # clear This operation will clear the whole sel. Continue?[y|N]y
Reset the CIMC log using the commands below:
Server# scope cimc Server /cimc # scope log Server /cimc/log # clear
Monitor the environment for 48 hours.
If errors persist, capture a fresh set of UCS and Chassis logs, confirm analysis, formulate an action plan based on the evidence, and proceed to the next section.
Cause
- DIMM errors are usually caused by a faulty DIMM, or sometimes by a bad Motherboard
Notes
- None