PowerFlex 4.X: SMART_AGGREGATED_STATE_FAILED - DAX Device - SIO03.02.0000013
Summary: PowerFlex reports that one of the dax devices is about to fail and should be replaced. NVDIMM correctable memory errors can cause a dax device to show as "Failed Now" in PowerFlex when the device is not truly failed. ...
Symptoms
This error usually is triggered with standard PowerFlex Device, the recommended action in this case will be to replace the Disk, however this article explains extra troubleshooting steps if the Device in question is DAX device which is an NVDIMM not a Disk.
PowerFlex reports the below alert, noting that one of the dax devices is about to fail or is operating at reduced performance. This is reflected in the form of a DialHome Alert as well:
SIO03.02.0000013
The disk may be about to fail, or may be operating with reduced performance.
SMART_AGGREGATED_STATE_FAILED_NOW
SDS.Device.SMART_Aggregated_State_Failed_Now
Recommended Action: Consider replacing the disk.

Cause
PowerFlex detects that a correctable or uncorrectable error occurred on one of the NVDIMMs for a storage node. Once an error is detected, it generates the Smart Aggregated State Failed Now error.
Resolution
Sanitizing NVDIMM can solve the problem without replacing the NVDIMM, replace NVDIMM only if the problem persists after sanitization.
- For 15G nodes, you can only sanitize all NVDIMMs - Remove all devices (SDS device + DAX device) before doing sanitization to avoid data corruption
- For 14G nodes, you can sanitize only the one NVDIMM in question or all NVDIMMs, follow the steps below to identify the failing NVDIMM
Validating the Failure
Step 1: In PowerFlex Manager, go to Block > Devices
Step 2: Select the columns box to the upper right and add "S.M.A.R.T State" to the list.

Step 3: Review the list of dax devices and confirm if there are any that show "Failed Now"

Step 4: For each dax device in a "Failed Now" state, determine which SDS the dax corresponds to and log in to the iDRAC.
Step 5: Determine if there are any correctable errors noted for each of the NVDIMMs in the Life-Cycle Logs.
- Correctable memory errors on NVDIMMs can cause a dax device to show "Failed Now" in PowerFlex, despite not being failed.

Step 6: SSH into the Primary MDM for the corresponding cluster and run the following commands. This is used to validate the dax health, slot information, and SN of the NVDIMM showing failed:
- Log in to the Primary MDM with admin login. use single quotes ' ' if password has special characters
scli --login --username admin --password '<password>' --management_system_ip <pfmp_ip>
- Run the following command to list all SDSs details:
scli --query_all_sds
Example:
SDS ID: 2e400d3800000004 Name: LAB10_SDS3 State: Connected, Joined IP: XXX
SDS ID: 2e400d3700000003 Name: LAB10_SDS2 State: Connected, Joined IP: XXX
SDS ID: 2e400d3600000002 Name: LAB10_SDS4 State: Connected, Joined IP: XXX
SDS ID: 2e400d3500000001 Name: LAB10_SDS1 State: Connected, Joined IP: XXX
SDS ID: 2e400d3400000000 Name: LAB10_SDS5 State: Connected, Joined IP: XXX
- Location the impacted SDS and note down the SDS name then run the following command
scli --query_sds --sds_name <sds name>
- Note down the device path for all devices for this SDS, will be used later to add the devices back in step 8
- Note down the impacted device id, then run the below command
# scli --query_sds_device_info --device_id <device_id>
Example output:
# scli --query_sds_device_info --device_id XXXXX
Device ID: _________ Name: /dev/dax1.0 Path: /dev/dax1.0
ScaleIO Device Configuration:
Original Path: /dev/dax1.0
Acceleration Pool: _________
Used for RFcache: no
Capacity Limit: 31.4 GB (32107 MB)
Device State: Normal
Physical Device Information:
Device Type: UNKNOWN
Media Type: NVDIMM
Auto Detected Media Type: UNKNOWN
Vendor Name: N/A
Model Name: nmem2
Serial Number: _________
Slot Number: B5
Firmware Version: N/A
Cache Look-ahead: not Active
Write Cache: not Active
ATA Security: not Active
Logical Sector Size: 0 B
Physical Sector Size: 0 B
Capacity: 0 GB
LED Setting: OFF
Background Device Scanner Information:
Scanned: 0 MB
Error Fixes: 0
Compare Errors: 0
SMART Information:
Aggregated State: FAILED_NOW
Temperature State: NEVER_FAILED
Current Value: 34 Worst Value: 34 Threshold: 0
Media Wearout Indicator State: NEVER_FAILED
Current Value: 95 Worst Value: 95 Threshold: 5
RAID Controller Information:
Serial Number: N/A
RAID vDisk Status: N/A
RAID vDisk Type: N/A
RAID vDisk Cache: N/A
Step 7: After identifying which dax corresponds to the noted NVDIMM, follow the standard procedure to sanitize the NVDIMMs through the iDRAC if the errors are correctable. Follow the standard guidelines to NVDIMM troubleshooting should the errors be uncorrectable.
- Doing this requires the removal of the devices from the SDS mandatory to avoid data corruption, follow the necessary steps outlined in the article below
Step 8: After the NVDIMMs have been sanitized add back the dax devices to the SDS then the powerflex devices accordingly to the previously noted device path, the dax devices will show as "Never Failed" The corresponding alert will also be cleared from the Alerts tab automatically.
check steps 9 and 10 in article PowerFlex: How to Replace NVDIMM in 15G and later PowerEdge Node
Additional Information
How to Sanitize NVDIMM: PowerFlex: How to Sanitize NVDIMM and SDPM - Video
DialHome Alert Recommendations: PowerFlex Top Dial Home Recommended Actions