PowerFlex: SDS device reports errors although the device is still in use and healthy

Summary: The MDM reports device errors from an SDS, but the SDS or DAX device in question is being used by the cluster and is healthy.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

MDM reports SDS or DAX devices error based on S.M.A.R.T. attributes. The drive is not ejected until there is an I/O issue.  

MDM events.txt

SDS device example:

2018-06-18 14:16:10.290 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR Device failure state reported on SDS: SIO-NODE3, Device: /dev/sdu

DAX device example:

2021-06-06 21:11:25.765 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax1.0.
2021-06-06 21:11:25.784 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax0.0.
2021-06-06 21:11:25.786 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax3.0.
2021-06-06 21:11:25.786 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax2.0.

The SDS ejects the drive when it encounters an I/O issue. Below you can see a similar error:

2018-06-19 01:28:38.662 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: SIO-NODE3, Device: /dev/sdb.
2018-06-19 01:28:38.962 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.

Note the difference in error messages:

 - SDS_DEV_MOVED_TO_FAILURE_STATE  <<<--- device not failed
 - SDS_DEV_ERROR_REPORT            <<<--- device failed 

Note, the system shows I/O discard/errors to that particular device and goes to DATA_DEGRADED, causing a rebuild.

Checking the hardware diagnostics of the host from iDRAC shows that the disk is having predictive failures, such as the following:

This impact can range from cosmetic (filling up the MDM events) to a drive failure, causing a rebuild.

Cause

The LIA agent on the SDS node samples the S.M.A.R.T. attributes of the storage devices that are used by the SDS. It passes this data to the MDM which then reports on any issues seen but does not act on the data.

No action is taken because the S.M.A.R.T. status only provides two values: "threshold not exceeded" and "threshold exceeded." Often these are represented as "drive OK" or "drive fail" respectively.

The "threshold exceeded" value indicates that there is a high probability that the drive will fail in the future that is the drive is about to fail. It may be catastrophic or subtle, like the inability to write to specific sectors or slower performance than the manufacturer claims.

Resolution

Run manual hardware diagnostics to determine if the SDS or DAX device in question must be replaced. Consult the hardware vendor as needed.


Impacted versions

ScaleIO 2.x.x

VxFlex OS 3.0.x

PowerFlex 3.5.x

PowerFlex 3.6.0.x-3.6.1.x


Fixed in version

LIA sampling design was improved in PowerFlex 3.6.0.3.

False positive MDM events were fixed in PowerFlex 3.6.2.

Affected Products

PowerFlex rack, PowerFlex custom node

Products

VxFlex Ready Nodes, PowerFlex appliance R650, PowerFlex appliance R6525, PowerFlex appliance R660, PowerFlex appliance R6625, Powerflex appliance R750, PowerFlex appliance R760, PowerFlex appliance R7625, PowerFlex appliance R640 , PowerFlex appliance R740XD, PowerFlex appliance R7525, PowerFlex appliance R840 ...
Article Properties
Article Number: 000049265
Article Type: Solution
Last Modified: 02 Jan 2025
Version:  6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.