Data Domain: Troubleshooting Memory Errors
Summary: This KB article describes how to troubleshoot memory alerts including how to identify a faulty DIMM that needs to be replaced.
Symptoms
This KB article describes how to troubleshoot memory alerts including how to identify a faulty DIMM that needs to be replaced.
Possible Symptoms / Alerts:
DIMM-00001: Correctable ECC logging limit reached DIMM-00002: Multibit Uncorrectable ECC error DIMM-00003: A memory card has failed ENVIRONMENT-00009: Memory correctable ECC errors exceed warning threshold ENVIRONMENT-00013: Memory uncorrectable ECC error alert. ENVIRONMENT-00044: Memory riser fault has been detected MEM-00001: DIMM failure detected after install. DDFS ""will not be started. MEM-00002: Memory size(nnnnnnnnKB) goes below the configured size(nnnnnnnnKB)
*These may also be reported without the hyphen (-): e.g.
DIMM00001, DIMM00002, DIMM00003, ENVIRONMENT00009, ENVIRONMENT00013, ENVIRONMENT00044, MEM00001, MEM00002
- IPMI Watchdog reboot
- Less memory than expected alert
- Possible hang in power-on self-test (POST)
- System will not boot or system crash
Cause
Uncorrectable memory errors could cause a system reboot and is considered a hard memory fault.
Total failure of any single DIMM or Memory Riser (houses multiple DIMMs) may result in a System Down event and prevent the Filesystem from being enabled. This is because the DD Filesystem (DDFS) process fills most of the physical memory.
NOTE: Memory errors can be masked by other symptoms or alerts - for example, CPU Machine Check Error - Deeper log analysis and troubleshooting may be required.
Resolution
DIMM Error reporting is tracked on all DDOS versions. However, analysis of system logs may be required to identify the specific DIMM at fault.
Note: A DIMM may reside within a Memory Riser attached to the motherboard.
For Data Domain Filesystem (DDFS) to be Enabled, ALL installed memory must be present & functional.
Troubleshooting may include:
- Offline Diagnostics
- Log file analysis
- Reseating suspect DIMMs
- Moving suspect DIMMs to 'known-good' slots (i.e. does the fault follow the DIMM, slot, channel or controller?)
- See 'Dell Swap Testing guide' Data Domain: Memory Card or DIMM With Failed or Faulty Error
- Replacement of Failed DIMM or Riser (as appropriate)
- Ongoing monitoring to confirm system stability after HW changes or replacement
Note: If your problem persists after executing the steps in this KB article, please contact your support provider or Create a service request .
Additional Information
References:
- See Relevant DD system Hardware Guides for DIMM config/layout
- Data Domain: System Memory Requirements and Expanded Storage Configurations
- Data Domain: Memory Card | DIMM Failed,error or faulty
- Data Domain: Memory Card or DIMM With Failed or Faulty Error