Data Domain: Troubleshooting Memory Errors
Summary: This article describes how to troubleshoot memory-related alerts on Dell Data Domain systems, including how to identify a faulty DIMM that requires replacement. It covers common alert codes, root causes of correctable and uncorrectable ECC errors, and step-by-step resolution guidance such as initiating POST-Package Repair (PPR), running diagnostic CLI commands, and performing physical troubleshooting. ...
Symptoms
-
One or more of the following alerts appear on the Data Domain system:
DIMM-00001: Correctable ECC logging limit reached DIMM-00002: Multibit Uncorrectable ECC error DIMM-00003: A memory card has failed ENVIRONMENT-00009: Memory correctable ECC errors exceed warning threshold ENVIRONMENT-00013: Memory uncorrectable ECC error alert ENVIRONMENT-00044: Memory riser fault has been detected MEM-00001: DIMM failure detected after install. DDFS will not be started. MEM-00002: Memory size (nnnnnnnnKB) goes below the configured size (nnnnnnnnKB)Note: These alerts may also appear without the hyphen (for example, DIMM00001, ENVIRONMENT00009, MEM00001).
-
IPMI Watchdog reboot occurs.
-
The system reports less memory than expected.
-
The system hangs during power-on self-test (POST).
-
The system fails to boot or crashes unexpectedly.
Cause
Data Domain DIMMs use Error Checking Code (ECC), which allows the system to correct single-bit memory errors automatically during operation. When the system cannot repair a memory error automatically, a reboot may be required to initiate POST-Package Repair (PPR).
Uncorrectable (multi-bit) ECC errors constitute hard memory faults and can force an automatic system reboot. A total failure of any single DIMM or memory riser (which houses multiple DIMMs) can result in a system-down event and prevent the DD Filesystem (DDFS) from starting, because the DDFS process requires most of the available physical memory.
Resolution
Follow the steps below to diagnose and resolve memory errors on a Data Domain system.
-
Reboot the Data Domain system to initiate POST-Package Repair (PPR).
- On modern Data Domain systems (Dell PowerEdge–based), reboot the appliance as the first recovery action. The BIOS firmware initiates PPR during POST to recover DIMMs affected by correctable and uncorrectable ECC errors.
- Refer to the applicable BIOS firmware documentation for PPR capabilities and requirements (Reference).
-
Compare the current system state with a previous Auto-Support.
- Review an Auto-Support report generated before the DIMM failure or alert occurred and compare it with the current state to identify changes.
-
Run diagnostic CLI commands via SSH.
-
Use the following DD-CLI commands to inspect memory status:
# alerts show current # system show meminfo # enclosure show memory # log view debug/messages.engineering ('q' to quit)
-
-
Perform physical troubleshooting (if accessible).
- Reseat the DIMM — Ensure both side latches engage fully.
- Swap the suspected DIMM with a known-good DIMM from another slot, channel, bank, or controller to isolate the fault.
- Attempt a minimal boot (if the system does not boot due to a suspected DIMM or memory fault) - Remove peripheral devices or add-in cards and leave only one DIMM installed in slot 0.
-
Identify the affected component and replace as needed.
- Determine whether the root cause is a faulty DIMM, CPU, or motherboard, and replace the identified component.
-
Collect a Support Bundle and open a Service Request.
- If the issue persists, gather a Support Bundle and create a Service Request with your contracted service provider. The following video shows how to gather a Support Bundle: Gather a Support Bundle
- If the issue persists, gather a Support Bundle and create a Service Request with your contracted service provider. The following video shows how to gather a Support Bundle: Gather a Support Bundle
Additional Information
References:
- See Relevant DD system Hardware Guides for DIMM config/layout
- Data Domain: System Memory Requirements and Expanded Storage Configurations
- Data Domain: Memory Card | DIMM Failed,error or faulty
- Data Domain: Memory Card or DIMM With Failed or Faulty Error