PowerEdge: What is DDR4 Self-healing with Intel Xeon Scalable Processors

Summary: Correctable and uncorrectable memory errors on PowerEdge Server with DDR4 and changes to troubleshooting steps

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

What is DDR4 "self-healing" on Dell PowerEdge Servers with Intel Xeon Scalable Processors (first or second generation) with BIOS version 2.1.x or above?

How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?

What are the "self-healing" enhancements in the newer BIOS versions?

Cause

There are ongoing improvements and enhancements to the Dell PowerEdge BIOS to improve memory event messaging, error handling, and "self-healing" that occur upon a server reboot. This prevents the need for a scheduled maintenance window or onsite presence to replace a DDR4 memory DIMM that was logging error events.

Resolution

There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and later. These enhancements do change the recommended steps or actions to take if memory events occur and are logged to the LifeCycle log.

Note:
  • If encountering memory errors with DDR4 on BIOS 2.0 or earlier, update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements. We always encourage customers to update to the latest available BIOS release (and iDRAC firmware) so that they can take advantage in the latest self-healing enhancements.
  • Previous memory troubleshooting steps included moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot. With BIOS 2.1.x or later, the first recommended step is to restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without scheduling any DIMM replacements.
  1. Memory retraining enhancements

Memory retraining which happens during boot (early in the Configuring Memory steps), optimizes the signal timing and margining for each DIMM/slot for best access. Memory signal timing and margining characteristics of a DIMM may change over time for several different reasons:

  • Changes in Server memory configuration
  • BIOS changes (Memory Reference Code - MRC)
  • Different operating temperatures of the server or DIMM
  • The general age of the DIMM

Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x, additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:

Warning - MEM0701 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."

 

Any of these errors being logged in the SEL or Lifecycle logs result in Memory retraining being scheduled for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated.

Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location DIMM_XX."

 

This Multi-bit error may result in the server rebooting due to a fatal error if the Operating System is unable to handle that error. Memory retraining automatically occur during that boot. If the multi-bit error occurs in a noncritical memory location that that operating system can handle, a reboot must be scheduled.

Memory retraining during POST may "self-heal" the failing DIMM and associated slot by optimizing the signal timing and margining. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.
 

  1. Post Package Repair (PPR)

The second "self-healing' memory enhancement is PPR. PPR repairs a failing memory location by disabling the location or address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.

Previously, this functionality was limited to the manufacturing process. As with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."

 

Any of these events in the logs results in PPR being scheduled for the next reboot (warm or cold) early in the Configuring Memory phase

Note: A Message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX.), without a corresponding MEM0005/MEM0701/MEM0702 on the same DIMM location, does not result in a PPR being scheduled for the next reboot.

See July 10, 2020 update for changes for the MEM8000 event and updated version 1.1 and newer white paper.

After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:

MEM9060 - "The Post Package Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."


A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation. An example of a failing critical PPR message is:

UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."

 

A newly published Whitepaper (version 1.0) describing Dell PowerEdge server Memory-related Reliability, Availability, and Serviceability (RAS) features is now available that describes the various RAS features and capabilities available on the PowerEdge Servers - Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features.

 

Updated April 24, 2020

Dell is continuing to enhance our "self-healing" capabilities. The following section lists the updates and enhancements associated with the different BIOS versions.

BIOS 2.1.x - Initial article publication of the "self-healing" capabilities available starting with BIOS 2.1.6 and higher, including example error messages and recommended actions.

BIOS 2.4.x and newer changes (December 2019)

  • MEM0702 (Correctable error rate exceeded…) - Message updated from a critical to warning. With recommended actions updated to reboot the server to allow "self-healing" to occur - For example, Post Package Repair.
    • December 2019 or newer iDRAC to also be installed to get the updated message
    • Recommended Action: Reboot the server to allow PPR to run
  • MEM9060 - Message description updated to indicate "self-healing" was successfully completed

BIOS 2.5.x and newer changes (February 2020)

  • A "Correctable Error Logging" BIOS option was added to allow customers to disable all Lifecycle or SEL logging related to correctable errors. All the "self-healing" features continue to function - For example, PPR and memory retraining are still scheduled and run during the next reboot (early in the Configuring Memory process).
  • Addition of MEM08xx errors for RDIMMs and LRDIMMs replacing existing error messages and actions. Existing error messages are still used for platforms that do not support the "self-healing" capabilities.
    • February 2020 or newer iDRAC is required for the new messages to be logged.
Note: Without the updated iDRAC, new BIOS messages are "unknown" in the SEL or Lifecycle logs.
  • MEM0802 - Replaced MEM0702 - correctable error rate exceeded
    • Recommended Action: Reboot the server to allow PPR to run. Confirm that PPR was successful (MEM0802)
  • MEM0804 - Replaced MEM9060 indicating PPR was successful. Now includes DIMM slot location that ran PPR
    • Recommended Action: None, this event indicates "self-healing" occurred, no DIMM replacement is needed.
  • MEM0805 - Replaced UEFI0278 indicating PPR failed
    • Recommended Action: Replace failing DIMM

Updated July 10, 2020

BIOS 2.7.x and newer changes (July 2020 block BIOS - targeted mid-July for web posting)

  • MEM8000 (Correctable error logging disabled) - Starting with BIOS ~2.0.x, Dell Engineering made a BIOS change to enhance the rate of correctable error detection that may impact performance. This change resulted in an uptick in MEM8000 events that were not substantiated from results from DIMM failure analysis. Starting with BIOS 2.7.x there are two changes related to MEM8000. The first is that signaling of the MEM8000 event has been modified. Second, BIOS schedules self-healing (PPR) for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
    • Recommended Action: Reboot the server to allow self-healing/PPR to run. Confirm that PPR was successful (MEM0804).
  • MEM0001 (Uncorrectable error) - Results in self-healing (PPR) to be scheduled for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
    • Recommended Action: None needed if the MEM0001 is associated with a critical page that the Operating System is unable to recover - Is still a fatal error resulting in a reboot. If the MEM0001 is associated with a noncritical page that the Operation System can recover from, a reboot must be scheduled to all self-healing (PPR) to occur. Confirm that PPR was successful (MEM0804).

UPDATED January 13, 2021

BIOS 2.8.2 and newer changes (September 2020 block BIOS)

  • MEM9072 (Uncorrectable error identified by the memory patrol scrub process- page is not consumed or in use) - Results in self-healing (PPR) to be scheduled for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
    • Recommended Action: Schedule a reboot soon. Delaying the reboot could result in the page being consumed resulting in a MEM0001 error that could result in a reboot occurring. Memory self-healing (PPR) runs during that reboot. Confirm that PPR was successful (MEM0804).
Note: The latest version of the Engineering white paper (version 1.3 - issue date November 20, 2020) is found at:  https://downloads.dell.com/manuals/common/dellemc_poweredge_yx4x_memoryras.pdf
For Intel Xeon E and AMD EPYC content, continue to reference the original Engineering white paper (version 1.0) which is found at: PowerEdge YX4X Server Memory RAS Whitepaper v1.0 (dell.com)

There are additional RAS feature enhancements being evaluated for inclusion in future BIOS updates.

 
Note: For detailed description and recommended actions for specific error code messages, reference the following link: Look Up (dell.com). Since error codes (such as MEM0001) apply to multiple generations of servers and platforms, the recommended actions may not be current for the particular BIOS version. The new error codes that have been added (such as MEM0802, MEM0804, MEM0805, and so on) only apply to Servers with Intel Xeon Scalable Processors (first or second generation).

 

This article is updated as new information becomes available.


Downloads and Drivers: Drivers & Downloads

Affected Products

Dell EMC XC Series XC6420 Appliance, Dell EMC XC Core 6420 System, Storage Spaces Direct R440 Ready Node, Storage Spaces Direct R640 Ready Node, Storage Spaces Direct R740xd Ready Node, Storage Spaces Direct R740xd2 Ready node, OEMR R240, OEMR R250 , OEMR XE R250, OEMR R260, OEMR XE R260, OEMR R340, OEMR R350, OEMR XE R350, OEMR R360, OEMR XE R360, OEMR R440, PowerEdge XR2, OEMR R450, OEMR R540, OEMR R550, OEMR R5500, OEMR R640, OEMR XL R640, OEMR R650, OEMR R650xs, OEMR R660, OEMR XL R660, OEMR R660xs, OEMR R740, OEMR XL R740, OEMR R740xd, OEMR XL R740xd, OEMR R740xd2, OEMR R750, OEMR R750xa, OEMR R750xs, OEMR R760, OEMR R760xa, OEMR R760XD2, OEMR XL R760, OEMR R760xs, OEMR R840, OEMR R860, OEMR R940, OEMR R940xa, OEMR R960, OEMR T140, OEMR T150, OEMR T340, OEMR T350, OEMR T360, OEMR T440, OEMR T550, OEMR T560, OEMR T640, OEMR XL T640, OEMR XL R240, OEMR XL R340, OEMR XL R660xs, OEMR XR11, OEMR XR12, OEMR XR4000r, OEMR XR4000w, OEMR XR4510c, OEMR XR4520c, OEMR XR5610, OEMR XR7620, OEMR XR8610t, OEMR XR8620t, Poweredge C4140, PowerEdge C6420, PowerEdge C6520, PowerEdge C6525, PowerEdge C6615, PowerEdge C6620, PowerEdge FC640, PowerEdge HS5610, PowerEdge HS5620, PowerEdge M640, PowerEdge MX740C, PowerEdge MX750c, PowerEdge MX760c, PowerEdge MX840C, PowerEdge R240, PowerEdge R250, PowerEdge R260, PowerEdge R340, PowerEdge R350, PowerEdge R360, PowerEdge R440, PowerEdge R450, PowerEdge R540, PowerEdge R550, PowerEdge R640, PowerEdge R650, PowerEdge R650xs, PowerEdge R660, PowerEdge R660xs, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R750, PowerEdge R750XA, PowerEdge R750xs, PowerEdge R760, PowerEdge R760XA, PowerEdge R760xd2, PowerEdge R760xs, PowerEdge R840, PowerEdge R860, PowerEdge R940, PowerEdge R940xa, PowerEdge R960, PowerEdge T140, PowerEdge T150, PowerEdge T160, PowerEdge T340, PowerEdge T350, PowerEdge T360, PowerEdge T440, PowerEdge T550, PowerEdge T560, PowerEdge T640, PowerEdge XE2420, PowerEdge XE7100, PowerEdge XE7420, PowerEdge XE7440, PowerEdge XE8640, PowerEdge XE9640, PowerEdge XE9680, PowerEdge XE9680L, PowerEdge XR11, PowerEdge XR12, PowerEdge XR4000r, PowerEdge XR4000w, PowerEdge XR4510c, PowerEdge XR4520c, PowerEdge XR5610, PowerEdge XR7620, PowerEdge XR8610t, PowerEdge XR8620t, PowerFlex appliance R650, PowerFlex appliance R660, Powerflex appliance R750, PowerFlex appliance R760, PowerFlex custom node R650, PowerFlex custom node R660, PowerFlex custom node R750, PowerFlex custom node R760, PowerFlex custom node R860, VxFlex Ready Node R640, VxFlex Ready Node R740xd, Dell EMC vSAN C6420 Ready Node, Dell EMC vSAN MX740c Ready Node, Dell EMC vSAN MX750c Ready Node, Dell vSAN Ready Node MX760c, Dell EMC vSAN R440 Ready Node, Dell EMC vSAN R640 Ready Node, Dell EMC vSAN R650 Ready Node, vSAN Ready Node R660, Dell EMC vSAN R740 Ready Node, Dell EMC vSAN R740xd Ready Node, Dell EMC vSAN R750 Ready Node, Dell EMC vSAN R760 Ready Node, Dell EMC vSAN R840 Ready Node, Dell EMC vSAN T350 Ready Node, PowerFlex appliance R640, PowerFlex appliance R740XD, PowerFlex appliance R840, VxFlex Ready Node R840, VxRail 460 and 470 Nodes, VxRail E560F, VxRail P570, VxRail P570F, VxRail S570, VxRail V570F ...
Article Properties
Article Number: 000053203
Article Type: Solution
Last Modified: 25 Nov 2025
Version:  26
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.