Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

14G Intel and 15G Intel / AMD PowerEdge servers: DDR4 memory: managing Correctable error threshold events

Summary: Updated recommendations for customers when managing correctable error threshold events (MEM0802 or MEM5104) on DDR4 RDIMMs or LRDIMMs installed in Intel based 14G and 15G PowerEdge Servers as well as AMD based 15G PowerEdge Servers. Note: This article does not apply to 14G AMD based PowerEdge servers, such as the 64x5 or 74x5 platforms, as they do not have this Post Package Repair / self healing capability, even though the DIMM itself supports it. ...

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Symptoms

Through the evolution of RAS (Reliability, Accessibility and Serviceability) features across Enterprise-Class memory, Dell has taken a conservative approach in error reporting to provide transparency to our customers. As this evolution continues, so too does Dell’s approach to error reporting to enable focus on notices that require a more urgent response versus notices that are primarily informational in nature.

As DRAM based memory geometries continue to shrink, providing customers with the increased performance they demand, an increasing number of correctable errors are expected as a natural part of uniform scaling.

Cause

Within the global server industry, there is an increasingly accepted understanding, shared by Dell, that some correctable errors per DIMM are unavoidable and do not inherently warrant a memory module replacement or even an immediate reboot to initiate self-healing.

Resolution

Continuing to operate a system reporting correctable errors without a reboot to self-heal, does not significantly increase the risk of experiencing uncorrectable errors that may lead to unplanned downtime. In fact, others in the industry have publicly communicated their memory handling does not report correctable errors.

In 14G Intel PowerEdge BIOS version 2.5.4 and newer, a BIOS setting was added called "Correctable Error Logging", to allow customers the option to disable correctable error reporting if they choose, and many have.  BIOS will continue to schedule self-healing for correctable threshold events even without the logging. This scheduled self-healing will automatically occur during the subsequent system reboot.

To draw more in line with the industry and continuing customer feedback, beginning in March 2022, Dell PowerEdge BIOS updates will change the "Correctable Error Logging" BIOS setting to disabled by default.  This BIOS option can be re-enabled for customers wanting to continue to see correctable memory threshold events. BIOS versions with this BIOS setting change included are:
  • 14G Intel Platforms - BIOS versions 2.13.3 or newer
  • 15G AMD Platforms - BIOS versions 2.6.5 or newer
  • 15G Intel Platforms - BIOS versions 1.5.5 or newer.

The benefits of DDR4 DIMM self-heal via a system reboot:
  • Enables repair of a DDR4 DIMM without removal from the system; all Dell-sourced DDR4 DIMMs support memory self-heal. Note - 14G AMD PowerEdge servers do not have this self-healing capability.
  • Utilizes available spare rows architected into the DRAM where a bad row is permanently replaced with a known good row by electrical fusing.
  • The subsequent memory retrain optimizes the “data eyes” by recalibrating the center points to ensure that the memory bus operates at the highest level of signaling integrity.


For correctable threshold events with the "Correctable Error Logging" BIOS setting Enabled, if memory threshold events occur, Dell recommends rebooting at the customer’s regular maintenance schedule to allow the scheduled memory self-healing or self-correcting to occur. After the reboot, successful or unsuccessful self-healing  events will be logged for the associated DIMMs.

With the "Correctable Error Logging" BIOS setting Disabled, Dell recommends rebooting at the customer’s regular maintenance schedule. Upon reboot, any scheduled self-healing operations will automatically run. The system will log an event (MEM0805 or MEM7114 type events) if the self-healing / self-correction operation was unsuccessful and further recommend physically replacing the affected DIMM.

Recommendation:
Dell EMC Memory Engineering recommends that PowerEdge Server customers on older BIOS versions (pre March block 2022 BIOS releases), adopt changing the "Correctable Error Logging" BIOS setting to Disabled. This will eliminate the sporadic correctable memory threshold events (such as MEM0802 or MEM5104 type events) across their server infrastructure that recommend server reboots to allow self-healing or self-correction to occur. As mentioned previously, any scheduled self-healing or self-correction operations will run automatically when the server is rebooted and any failures will be reported.
 

The "Correctable Error Logging" BIOS setting can be changed either by rebooting the server to F2 Settings or via the iDRAC GUI.
 

To change the BIOS setting  using F2 Settings:

  • Reboot the servers stopping at F2 settings
  • In the  BIOS Settings -> Memory Settings selection, change the "Correctable Error Logging" to disabled.
  • Save the BIOS settings and exit F2 settings

To change the BIOS setting using the iDRAC GUI:

  • Log into the iDRAC GUI
  • Under Configuration -> BIOS Settings, expand the Memory Settings section
  • Change the "Correctable Error Logging" setting to disabled
  • Click on Apply button to save the Memory Settings
  • Don't forget to select either the Apply and Reboot button (to reboot immediately) or At Next Reboot button to apply the BIOS changes.


Existing memory related KB articles and whitepapers will be updated to reflect this recommended change.

NOTE: The approved customer facing messaging is attached as a file to this article - "Managing Correctable Error Notices Dec 2021 v1.pdf".

This article will be updated as new information becomes available.

Article Properties


Affected Product
AX-6515, AX-7525, Dell EMC vSAN C6420 Ready Node, Dell EMC vSAN MX740c Ready Node, Dell EMC vSAN R440 Ready Node, Dell EMC vSAN R640 Ready Node, Dell EMC vSAN R650 Ready Node, Dell EMC vSAN R6515 Ready Node, Dell EMC vSAN R740 Ready Node , Dell EMC vSAN R740xd Ready Node ...
Product
Dell EMC XC Series XC6420 Appliance, Dell EMC XC Core 6420 System, Storage Spaces Direct R440 Ready Node, Storage Spaces Direct R640 Ready Node, Storage Spaces Direct R740xd Ready Node, Storage Spaces Direct R740xd2 Ready node, OEMR R340, OEMR R440 , PowerEdge XR2, OEMR R540, OEMR R640, OEMR XL R640, OEMR R650, OEMR R650xs, OEMR R6515, OEMR R6525, OEMR R740, OEMR XL R740, OEMR R740xd, OEMR XL R740xd, OEMR R740xd2, OEMR R750, OEMR R750xa, OEMR R750xs, OEMR R7515, OEMR R7525, OEMR R840, OEMR R940, OEMR R940xa, OEMR T440, OEMR T550, OEMR T640, OEMR XL T640, OEMR XL R340, PowerEdge C6420, PowerEdge C6525, PowerEdge MX740C, PowerEdge MX840C, PowerEdge R340, PowerEdge R540, PowerEdge R640, PowerEdge R650, PowerEdge R650xs, PowerEdge R6515, PowerEdge R6525, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R750, PowerEdge R750XA, PowerEdge R750xs, PowerEdge R7515, PowerEdge R7525, PowerEdge R840, PowerEdge R940, PowerEdge R940xa, PowerEdge T440, PowerEdge T550, PowerEdge T640, PowerFlex appliance R650, PowerFlex appliance R6525, Powerflex appliance R750, PowerFlex custom node R650, PowerFlex custom node R6525, PowerFlex custom node R750, VxFlex Ready Node R640, VxFlex Ready Node R740xd, Dell EMC vSAN R750 Ready Node, Dell EMC vSAN R7515 Ready Node, Dell EMC vSAN R840 Ready Node, PowerFlex appliance R640, PowerFlex appliance R740XD, PowerFlex appliance R840, VxFlex Ready Node R840, Dell EMC XC Core XC7525 ...
Last Published Date

10 Feb 2022

Version

2

Article Type

Solution