NVDIMM-N: What to know about error message differences and "self healing" capabilities
Summary: This article takes about the different error messages on NVDIMM-N and self healing and does it differ from RDIMMs and LRDIMMs.
Instructions
NVDIMM-N:
Does NVDIMM-N support "self healing" capabilities provided for standard RDIMM/LRDIMM memory as part of BIOS 2.1.8 and newer?
What are some of the differences in error logging behavior as the result of BIOS changes?
What do Technical Support recommend and what actions should be taken for the different error messages?
What are some key differences between RDIMM/LRDIMM and NVDIMM-N modules?
Self Healing Capabilities
Post Package Repair (PPR) - NVDIMM-N memory modules do not support PPR functionality like standard RDIMM/LRDIMMs.
Memory Retraining - NVDIMM-N memory modules do support Memory retraining. Memory retraining scheduling is not specific to a DIMM slot location but applies to all devices plugged into the memory bus. So all the same triggers from RDIMMs/LRDIMMs apply - that is memory errors, configuration changes, so forth.
Persistent memory scrubbing helps identify multi-bit or uncorrectable errors on NVDIMM-N, mitigating future failures, though it is not a self-healing capability.
BIOS logs the error and provides the memory location to the OS, which adds it to a blacklist of bad memory locations to avoid. When these memory locations are not "consumed" or in use by the OS, these are not critical errors and are not fatal.
There are three BIOS settings for Persistent Memory scrubbing:
Auto: The system automatically scrubs persistent memory during POST when multi-bit errors have been detected.
This is a background operation.
One Shot: The system will scrub persistent memory during POST on the entire persistent memory range once. Upon the next boot, the system will go back to "Auto" persistent memory scrub mode.
Enable: The system will scrub persistent memory during POST on the entire persistent memory range on every boot.
Possible NVDIMM-N Persistent Memory scrubbing error messages and actions or recommendations
See examples in the section below on Key differences between standard RDIMM/LRDIMM and NVDIMM-N modules.
MEM0001 - All BIOS revisions
Multi bit memory errors detected on a memory device at location arg1
Persistent Memory scrub identified a multi-bit/uncorrectable error on a consumed (in-use) memory page.
Information: The bad page or location is added to the bad list in the MB NVRAM for that slot. During POST, depending on the server's BIOS version, MEM0702/MEM9072/MEM9022 errors may be reported as the page or location is not yet consumed.
Recommendation: Replace NVDIMM-N referenced.
MEM0702 - Prior to BIOS 2.5.4 (February 2020):
Actual message: Correctable memory error rate exceeded for arg1
Secondary meaning for NVDIMM-N: Persistent Memory scrub identified an uncorrectable error on a non-consumed memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Information: If this error is due to a bad page or location that persistent memory scrubbing identifies, this error occurs during POST upon every reboot. To determine if the error is due to a correctable error rate or a bad page identified by memory scrub, check the SPD data...
Recommendation: Replace NVDIMM-N referenced.
MEM9072 - BIOS 2.5.4 (February 2020) through BIOS 2.6.4 (May 2020):
Actual message: The system memory has faced an uncorrectable multi-bit memory error in the non-execution path of a memory device at the location arg1.
Secondary meaning for NVDIMM-N: Persistent Memory scrub identified an uncorrectable error on a non-consumed (non-execution path) memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Information: Unlike MEM0001 errors, while this error is an uncorrectable/multi-bit error, it is not a "fatal" error that results in a Server reboot when it occurs.
Recommendation: No action is necessary. The bad memory page or location has been provided to the OS "black list" and will not be used. This error message occurs during POST upon every reboot.
MEM9022 - Newer than BIOS 2.7.x (Post July 2020) - planned changes:
Actual message: A Non-Critical event was detected on the Non-Volatile Dual In-line memory module (NVDIMM) device in the slot arg1.
Persistent Memory scrub identified an uncorrectable error on a non-consumed memory page or location that is not in use. That memory page or location has been provided to the OS "black list" to not be used.
Recommendation: No action is necessary. The bad memory page or location has been provided to the OS "black list" and will not be used.
Key differences between standard RDIMM/LRDIMM and NVDIMM-N modules
Do not move NVDIMM-N modules between server types - that is from AMD-based server to Intel-based Server.
The existing persistent data on the NVDIMM-N module may no longer be accessible.
There are differences in CRC algorithms on different system types that result in unexpected errors (MEM0001, MEM0702, MEM9072, or MEM9022).
When moving an NVDIMM-N module to a different system, sanitize it in the new system to ensure it functions as expected. Sanitizing NVDIMM-N modules erase any data on the NVDIMM-N.
Do not move NVDIMM-N modules from one slot location to another - that is for troubleshooting. NVDIMM-N modules are configured in the OS in either a stand-alone or interleaving configuration, on a per slot basis. Physically relocating the NVDIMM-N modules may result in data loss as the NVDIMM-N module in a given slot location no longer matches the current OS configuration.
If there is no valid data on the NVDIMM-N and NVDIMM-N modules must be moved to a different slot (that is, swapped for troubleshooting):
Be sure to sanitize (erase) the associated NVDIMM-N modules. If there are persistent memory scrubbing errors logged during POST (when providing an existing "bad" list to OS) on a particular slot, those errors continue on that slot even if the affected NVDIMM-N module is swapped/moved to a different slot. The sanitized operation, in addition to erasing the data on the NVDIMM-N module, clears the MB NVRAM held "bad" list associated with a given slot.
Configure the NVDIMM-N modules as needed in the OS and restore the customer data.
NVDIMM-N modules contain firmware which can introduce behavior dependency issues when changing firmware versions. If an NVDIMM-N module's firmware is downgraded from its original version, it must be sanitized before use. Failing to do so will likely result in "false" errors (MEM0702, MEM9072, or MEM9022) being reported by the persistent memory scrubbing.
In a recent case, multiple servers had their NVDIMM-N firmware downgraded from version 9772 to 9324 without sanitizing the modules afterward. These servers reported MEM0702 errors (either during or shortly after post) across many of the NVDIMM-N modules. Sanitizing (erasing) the NVDIMM-N modules resolved these "false" persistent memory scrubbing errors
More information about NVDIMM-N memory can be found in the Dell EMC NVDIMM-N Persistent Memory User Guide that is available in the Manuals and Documents tab for the platform found at: https://www.dell.com/support/home