PowerEdge: DDR4 Self-healing on Dell PowerEdge Servers with AMD Rome and Milan processors

Summary: An explanation of correctable memory errors on AMD PowerEdge Servers with DDR4 memory and changes to the troubleshooting steps

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

What is DDR4 "self-healing" on AMD Rome and Milan processor based PowerEdge Servers (R65xx, R75xx, and C65xx)?

Do the previous generation AMD based PowerEdge servers with AMD EPYC processors (R64xx and R74xx) support these same "self-healing" capabilities?

How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?

Cause

There are ongoing improvements and enhancements to the Dell Technologies PowerEdge BIOS to improve Memory error event messaging, error handling, and "self-healing" upon a server reboot, that prevents the need for a scheduled maintenance window and onsite presence to replace a DDR4 memory DIMM that was logging error events.

Resolution

There are two main memory-related "self-healing" BIOS enhancements that are included with AMD processor based PowerEdge Servers (65xx and 75xx) with DDR4 memory available at product launch. These enhancements do change the recommended steps and actions to take if memory errors occur and are logged to the Lifecycle log.
 

Note: The "self-healing" enhancements discussed in this article do not apply to the previous generation of AMD based PowerEdge servers with AMD EPYC processors. The 64xx and 74xx AMD PowerEdge Servers do not contain any of the "self-healing" enhancements described in this article. Memory retraining only occurs when changes in server memory configuration are detected. The version 1.0 of the Engineering white paper does describe some of the RAS features available for AMD EPYC processors - PowerEdge YX4X Server Memory RAS Whitepaper v1.0 (dell.com)

 

Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.

With AMD Rome and Milan based PowerEdge servers, the first recommended step is a reboot or restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.

We always encourage customers to update to the latest available BIOS release (and iDRAC firmware) so that they take advantage in the latest self-healing enhancements.


1. Memory retraining enhancements - Memory retraining, which happens during boot, optimizes the signal timing or margining for each DIMM and slot for best access. Timing characteristics of a DIMM may change for several different reasons:

  • Changes in Server memory configuration
  • BIOS changes
  • Different operating temperatures of the Server or DIMM
  • The general age of the DIMM

Current AMD Rome and Milan based PowerEdge servers (65xx and 75xx) perform Memory retraining upon every boot. This differs from the current Intel based PowerEdge server implementation.

If any of the following errors are logged to in the SEL or Lifecycle logs, the Dell Technologies engineering's recommendation is to reboot the server to allow for memory retraining to occur.

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."
Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location(s) DIMM_XX.

With any of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot or restart may "self-heal" the failing DIMM by optimizing the signal timing and margining for each DIMM and slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.
 

2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location or address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.
 

Previously, this functionality was limited to the manufacturing process. Like with the memory retraining enhancements mentioned earlier, in there are certain correctable and uncorrectable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."
Critical - MEM9072 - "The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location arg1."

Any of these errors being logged in the SEL/Lifecycle log results in PPR being scheduled for the next reboot (warm or cold).

Note: A Message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX.) Without a corresponding MEM0005 or MEM0701 or MEM0702 on the same DIMM location does not result in a PPR being scheduled for the next reboot. After the reboot, verify that the PPR operation was successfully performed.

An example of a successful PPR operation is similar to:

  • Message ID MEM9060 - "The PostPackage Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."

A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is:

  • Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."


Updated April 24, 2020

Dell Technologies is continuing to enhance and expand our "self-healing" capabilities. The following section documents the updates/enhancements and what BIOS version the changes were implemented in.

BIOS 1.0.x - Initial article publication of the "self-healing" capabilities available starting with BIOS 1.0.x and higher, including example error messages and recommended actions.

BIOS 1.1.x and newer changes (December 2019)

  • MEM0702 (Correctable error rate exceeded […]) - Message updated from a critical to warning event and the recommended actions updated to reboot the server to allow "self-healing"(Post Package Repair (PPR)) to occur.
    • Requires December 2019 or newer iDRAC to also be installed to get the updated message
    • Recommended Action: Reboot the server to allow PPR to run
  • MEM9060 - Message description updated to indicate "self-healing" was successfully completed

BIOS 1.2.x and newer changes (February 2020)

  • A "Correctable Error Logging" BIOS option was added to allow customers to disable all Lifecycle and SEL logging related to correctable errors. All the "self-healing"(PPR) features still function and memory retraining is still scheduled and run during the next reboot.
  • Addition of MEM08xx errors for RDIMMs and LRDIMMs replacing existing error messages and actions. Existing error messages are still used for platforms that do not support the "self-healing" capabilities.
    • Requires February 2020 or newer iDRAC for messages to get logged 

 

Note: Without updated iDRAC, new BIOS messages are "unknown" in the SEL and LC logs.

 

  • MEM0802 - Replaced MEM0702  - correctable error rate exceeded
    • Recommended Action: Reboot the server to allow PPR to run
  • MEM0804 - Replaced MEM9060 indicating PPR was successful. Now includes DIMM slot locations that ran PPR
    • Recommended Action: None, it indicates "self-healing" occurred, no DIMM replacement is needed.
  • MEM0805 - Replaced UEFI0278 indicating PPR failed
    • Recommended Action: Replacing failing DIMM

Updated January 25, 2021

BIOS 1.7.x and newer changes (December 2020)
 

  • MEM8000 (Correctable error logging disabled) - Early in BIOS, Dell Technologies Engineering made a BIOS change to enhance the rate of correctable error detection that may impact performance. This change resulted in an uptick in MEM8000 events that was not substantiated by results from memory component failure analysis. Starting with BIOS1.7.x there are two changes related to MEM8000. The first is that signaling of the MEM8000 event has been modified. Second, BIOS schedules self-healing (PPR) for the next reboot. iDRAC messages are not yet updated to reflect the new actions
    • Recommended Action: Reboot the server to allow self-healing/PPR to run. Confirm that PPR was successful (MEM0804).



There are additional RAS feature enhancements being evaluated for inclusion in future BIOS updates.

A white paper is planned that describes Dell Technologies PowerEdge server (AMD Rome and Milan based processors) Memory-related Reliability, Availability, and Serviceability (RAS) features.

This article is updated as new information becomes available.

Affected Products

OEMR R6515, OEMR R6525, OEMR R7515, OEMR R7525, PowerEdge R6515, PowerEdge R6525, PowerEdge R7515, PowerEdge R7525, PowerFlex appliance R6525, PowerFlex custom node R6525, Dell EMC vSAN R6515 Ready Node, Dell EMC vSAN R7515 Ready Node , PowerFlex appliance R7525 ...
Article Properties
Article Number: 000062034
Article Type: Solution
Last Modified: 13 Aug 2025
Version:  11
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.