Data Domain: Memory Card or DIMM With Failed or Faulty Error
Summary: This document serves to help with identifying the error or and fault and provide a resolution path.
Symptoms
Applies to:
- All Data Domain systems
- All software versions of Data Domain Operating System (DDOS)
DIMM-00001: Correctable ECC logging limit reached
DIMM-00002: Multibit Uncorrectable ECC error
DIMM-00003: A memory card has failed
ENVIRONMENT-00009: Memory correctable ECC errors exceed warning threshold
ENVIRONMENT-00013: Memory uncorrectable ECC error alert.
ENVIRONMENT-00044: Memory riser fault has been detected
MEM-00001: DIMM failure detected after install. DDFS ""will not be started.
MEM-00002: Memory size(nnnnnnnnKB) goes below the configured size(nnnnnnnnKB)
Cause
The DIMMs installed on Data Domain systems have Error Checking Code (ECC) which allows for Correctable Memory Errors to be fixed on-the-fly. If an error threshold is breached, then DDOS identifies the fault and an appropriate Alert will be generated on the system.
Uncorrectable memory errors may cause a system reboot and is considered a hard memory fault. Total failure of any single DIMM or Memory Riser may result in a System Down event and prevent the Filesystem from being enabled. This is because the Data Domain File System (DDFS) process fills most of the physical memory.
Resolution
NOTE: If an DIMM error is reported on Dell PowerEdge based systems, the first action to recover is to reboot the DataDomain unit. This will initiate PPR (POST Package Repair) to recover the DIMM.
Efforts must be made to determine the cause of the alert and identify the affected component DIMMs, CPU, or Motherboard, and replace parts as needed.
If possible, gather a Support Bundle and create a Service Request with your contracted Service Provider. The following video shows how to gather a Support Bundle: Gather a Support Bundle
Resolution Guidelines:
- For Dell PowerEdge based systems, initiate a system reboot to facilitate automatic POST-Package Repair (PPR); for the recovery of the DIMM.
- Improvements in BIOS Firmware allow for PPR to recover DIMM correctable & uncorrectable Errors (Reference)
- Compare current system state with an Auto-Support from BEFORE the DIMM failure or alert
- Useful DD-CLI (SSH) commands for checking memory:
# alerts show current
# system show meminfo
# enclosure show memory
# log view debug/messages.engineering ('q' to quit)
- Use DDOS Offline Diagnostics to test and determine fault. Go to Dell Support to access the Dell EMC Data Domain Operating System 6.x Offline Diagnostics Suite User Guide
- If possible, perform physical troubleshooting methods to eliminate and determine faulty component (using documented replacement guides and procedures).
- Reseat the DIMM - ensure that both sides are latched properly.
- Swap it with a known good DIMM from another slot, channel, bank, or controller:
- If a system is down (no boot) due to a suspected memory/dimm fault, try a minimal boot option (remove peripheral devices, or cards and leave 1x DIMM in slot '0')
Additional Information
- See knowledge article 130388: PowerProtect and Data Domain Hardware Documents for relevant information about DIMM configuration and layout.
- See related knowledge article 82030: Data Domain: System Memory Requirements and Expanded Storage Configurations