PowerFlex: SDS device reports errors although the device is still in use and healthy
Summary: The MDM reports device errors from an SDS, but the SDS or DAX device in question is being used by the cluster and is healthy.
Symptoms
MDM reports SDS or DAX devices error based on S.M.A.R.T. attributes. The drive is not ejected until there is an I/O issue.
MDM events.txt
SDS device example:
2018-06-18 14:16:10.290 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR Device failure state reported on SDS: SIO-NODE3, Device: /dev/sdu
DAX device example:
2021-06-06 21:11:25.765 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax1.0. 2021-06-06 21:11:25.784 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax0.0. 2021-06-06 21:11:25.786 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax3.0. 2021-06-06 21:11:25.786 SDS_DEV_MOVED_TO_FAILURE_STATE ERROR A device failure state exists on SDS: SIO-NODE3, Device: /dev/dax2.0.
The SDS ejects the drive when it encounters an I/O issue. Below you can see a similar error:
2018-06-19 01:28:38.662 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: SIO-NODE3, Device: /dev/sdb. 2018-06-19 01:28:38.962 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
Note the difference in error messages:
- SDS_DEV_MOVED_TO_FAILURE_STATE <<<--- device not failed - SDS_DEV_ERROR_REPORT <<<--- device failed
Note, the system shows I/O discard/errors to that particular device and goes to DATA_DEGRADED, causing a rebuild.
Checking the hardware diagnostics of the host from iDRAC shows that the disk is having predictive failures, such as the following:
This impact can range from cosmetic (filling up the MDM events) to a drive failure, causing a rebuild.
Cause
The LIA agent on the SDS node samples the S.M.A.R.T. attributes of the storage devices that are used by the SDS. It passes this data to the MDM which then reports on any issues seen but does not act on the data.
No action is taken because the S.M.A.R.T. status only provides two values: "threshold not exceeded" and "threshold exceeded." Often these are represented as "drive OK" or "drive fail" respectively.
The "threshold exceeded" value indicates that there is a high probability that the drive will fail in the future that is the drive is about to fail. It may be catastrophic or subtle, like the inability to write to specific sectors or slower performance than the manufacturer claims.
Resolution
Run manual hardware diagnostics to determine if the SDS or DAX device in question must be replaced. Consult the hardware vendor as needed.
Impacted versions
ScaleIO 2.x.x
VxFlex OS 3.0.x
PowerFlex 3.5.x
PowerFlex 3.6.0.x-3.6.1.x
Fixed in version
LIA sampling design was improved in PowerFlex 3.6.0.3.
False positive MDM events were fixed in PowerFlex 3.6.2.