NVDIMM-N: What to know about error message differences and "self healing" capabilities

Summary: This article takes about the different error messages on NVDIMM-N and self healing and does it differ from RDIMMs and LRDIMMs.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

NVDIMM-N:


Does NVDIMM-N support "self healing" capabilities provided for standard RDIMM/LRDIMM memory as part of BIOS 2.1.8 and newer?
What are some of the differences in error logging behavior as the result of BIOS changes?
What do Technical Support recommend and what actions should be taken for the different error messages?
What are some key differences between RDIMM/LRDIMM and NVDIMM-N modules?


 


Self Healing Capabilities

Post Package Repair (PPR) - NVDIMM-N memory modules do not support PPR functionality like standard RDIMM/LRDIMMs.
Memory Retraining - NVDIMM-N memory modules do support Memory retraining. Memory retraining scheduling is not specific to a DIMM slot location but applies to all devices plugged into the memory bus. So all the same triggers from RDIMMs/LRDIMMs apply - that is memory errors, configuration changes, so forth.
Persistent memory scrubbing helps identify multi-bit or uncorrectable errors on NVDIMM-N, mitigating future failures, though it is not a self-healing capability.
BIOS logs the error and provides the memory location to the OS, which adds it to a blacklist of bad memory locations to avoid. When these memory locations are not "consumed" or in use by the OS, these are not critical errors and are not fatal.

There are three BIOS settings for Persistent Memory scrubbing:

Auto: The system automatically scrubs persistent memory during POST when multi-bit errors have been detected.
This is a background operation.
One Shot: The system will scrub persistent memory during POST on the entire persistent memory range once. Upon the next boot, the system will go back to "Auto" persistent memory scrub mode.
Enable: The system will scrub persistent memory during POST on the entire persistent memory range on every boot.

 

Note: Scrubbing persistent memory (One Shot or Enable) may take over 60 minutes during POST, depending on memory population, before booting the OS.



Possible NVDIMM-N Persistent Memory scrubbing error messages and actions or recommendations

 

 

Note: Any VxFlex OS I/O errors that occur as the result of any of the following LifeCycle/SEL reported errors, may require VxFlex specific recovery actions.

 

 

Note: Note: Do not automatically replace the NVDIMM-N modules when encountering any of the following errors. The first step is to review the LifeCycle/SEL logs to determine what actions or events may have led up to the error being logged. These errors can be false, resulting from not following recommended actions, such as after a firmware downgrade or moving NVDIMM-N modules.


See examples in the section below on Key differences between standard RDIMM/LRDIMM and NVDIMM-N modules.

 


MEM0001 - All BIOS revisions

Multi bit memory errors detected on a memory device at location arg1
Persistent Memory scrub identified a multi-bit/uncorrectable error on a consumed (in-use) memory page.
Information:  The bad page or location is added to the bad list in the MB NVRAM for that slot. During POST, depending on the server's BIOS version, MEM0702/MEM9072/MEM9022 errors may be reported as the page or location is not yet consumed.
Recommendation:  Replace NVDIMM-N referenced.

 

MEM0702 - Prior to BIOS 2.5.4 (February 2020):

Actual message: Correctable memory error rate exceeded for arg1
Secondary meaning for NVDIMM-N: Persistent Memory scrub identified an uncorrectable error on a non-consumed memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Information: If this error is due to a bad page or location that persistent memory scrubbing identifies, this error occurs during POST upon every reboot. To determine if the error is due to a correctable error rate or a bad page identified by memory scrub, check the SPD data...
Recommendation:  Replace NVDIMM-N referenced.

 

MEM9072 - BIOS 2.5.4 (February 2020) through BIOS 2.6.4 (May 2020):

Actual message: The system memory has faced an uncorrectable multi-bit memory error in the non-execution path of a memory device at the location arg1.
Secondary meaning for NVDIMM-N: Persistent Memory scrub identified an uncorrectable error on a non-consumed (non-execution path) memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Information: Unlike MEM0001 errors, while this error is an uncorrectable/multi-bit error, it is not a "fatal" error that results in a Server reboot when it occurs.
Recommendation: No action is necessary. The bad memory page or location has been provided to the OS "black list" and will not be used. This error message occurs during POST upon every reboot.

 

Note: BIOS version 2.7.x will not report memory scrubbing-related errors (MEM0702 or MEM9072), but the associated bad page or location will still be added to the bad list (maintained in the MB NVRAM) that is associated with that slot location.

 

MEM9022 - Newer than BIOS 2.7.x (Post July 2020) - planned changes:

Actual message: A Non-Critical event was detected on the Non-Volatile Dual In-line memory module (NVDIMM) device in the slot arg1.
Persistent Memory scrub identified an uncorrectable error on a non-consumed memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Recommendation:  No action is necessary. The bad memory page or location has been provided to the OS "black list" and will not be used.

Key differences between standard RDIMM/LRDIMM and NVDIMM-N modules
Do not move NVDIMM-N modules between server types - that is from AMD-based server to Intel-based Server.
The existing persistent data on the NVDIMM-N module may no longer be accessible.
There are differences in CRC algorithms on different system types that result in unexpected errors (MEM0001, MEM0702, MEM9072, or MEM9022).
When moving an NVDIMM-N module to a different system, sanitize it in the new system to ensure it functions as expected. Sanitizing NVDIMM-N modules erase any data on the NVDIMM-N.
Do not move NVDIMM-N modules from one slot location to another - that is for troubleshooting. NVDIMM-N modules are configured in the OS in either a stand-alone or interleaving configuration, on a per slot basis. Physically relocating the NVDIMM-N modules may result in data loss as the NVDIMM-N module in a given slot location no longer matches the current OS configuration.
If there is no valid data on the NVDIMM-N and NVDIMM-N modules must be moved to a different slot (that is, swapped for troubleshooting):
Be sure to sanitize (erase) the associated NVDIMM-N modules. If there are persistent memory scrubbing errors logged during POST (when providing an existing "bad" list to OS) on a particular slot, those errors continue on that slot even if the affected NVDIMM-N module is swapped/moved to a different slot. The sanitized operation, in addition to erasing the data on the NVDIMM-N module, clears the MB NVRAM held "bad" list associated with a given slot.
Configure the NVDIMM-N modules as needed in the OS and restore the customer data.
NVDIMM-N modules contain firmware which can introduce behavior dependency issues when changing firmware versions. If an NVDIMM-N module's firmware is downgraded from its original version, it must be sanitized before use. Failing to do so will likely result in "false" errors (MEM0702, MEM9072, or MEM9022) being reported by the persistent memory scrubbing. 
In a recent case, multiple servers had their NVDIMM-N firmware downgraded from version 9772 to 9324 without sanitizing the modules afterward. These servers reported MEM0702 errors (either during or shortly after post) across many of the NVDIMM-N modules. Sanitizing (erasing) the NVDIMM-N modules resolved these "false" persistent memory scrubbing errors

 

 

Note: Make sure to update the BIOS to the latest as per our Dell support site.

 

More information about NVDIMM-N memory can be found in the Dell EMC NVDIMM-N Persistent Memory User Guide that is available in the Manuals and Documents tab for the platform found at: https://www.dell.com/support/home

Affected Products

VxFlex Ready Nodes, PowerFlex Appliance, OEMR R340, OEMR R440, PowerEdge XR2, OEMR R540, OEMR R640, OEMR XL R640, OEMR R740, OEMR XL R740, OEMR R840, OEMR R940, OEMR T140, OEMR T340, OEMR T440, OEMR T640, OEMR XL T640, OEMR XL R240, OEMR XL R340 , PowerEdge FC640, PowerEdge M640, PowerEdge M640 (for PE VRTX), PowerEdge MX5016s, PowerEdge MX7000, PowerEdge MX740C, PowerEdge MX840C, PowerEdge R240, PowerEdge R340, PowerEdge R440, PowerEdge R540, PowerEdge R640, PowerEdge R6415, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R7415, PowerEdge R7425, PowerEdge R840, PowerEdge R940, PowerEdge R940xa, PowerEdge T140, PowerEdge T340, PowerEdge T40, PowerEdge T440, PowerEdge T640, PowerEdge XE2420, PowerEdge XE7100, PowerEdge XE7420, PowerEdge XE7440, VxFlex Ready Node ...
Article Properties
Article Number: 000052811
Article Type: How To
Last Modified: 19 Nov 2024
Version:  6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.