PowerFlex 3.X Data Unavailable From A Single SDS Decouple Event
Summary: PowerFlex system goes into a DATA FAILURE state from a single SDS decouple event.
Symptoms
- PowerFlex System was in a normal, healthy state before any events.
- The MDM event logs show a single SDS decoupling, and the system goes to a DATA FAILURE state almost immediately. The system stays in this state, even though the SDS reconnects to the system.
2023-12-18 14:39:48.489000:1047016:SDS_DECOUPLED ERROR SDS: sds93 (id: f1f8bfde00000023) decoupled. 2023-12-18 14:39:49.403000:1047017:MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 2023-12-18 14:39:50.406000:1047018:MDM_DATA_FAILED CRITICAL The system is now in DATA FAILURE state. Some data is unavailable. 2023-12-18 14:40:06.143000:1047036:SDS_RECONNECTED INFO SDS: sds93 (ID f1f8bfde00000023) reconnected.
- After the system goes into a DATA FAILURE state, there is now a disk (or multiple disks) on a different SDS in an error state showing in the Presentation Server UI and scli output.
- The DATA FAILURE state can be exited by clearing the device errors.
- The device(s) that went to an error state had been set to a WARNING state previously as noted in MDM event logs.
2023-12-14 11:58:07.680000:0955611:SDS_DEV_WARNING WARNING A device warning threshold has been reached on SDS: sds93, Device: /dev/sdj. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS
- If an MDM switchover occurs, the device enters an ERROR state.
Impact
A portion of data are unavailable until the device errors are cleared.
I/O errors can be seen by clients.
An MDM switchover will cause the Device to enter an Error state and may result in a Data Unavailable (DU) if this occurs on multiple hosts simultaneously.
Cause
The WARNING state of the device was not handled correctly by the MDM and this technically was the first failure, even though the disk was still in use and physically fine. When the second SDS failed, this was the second failure and what caused the DATA FAILURE state.
Resolution
If already impacted by this series of events:
- Clear the devices that are now showing in an error state.
If not yet impacted by a DATA FAILURE state:
- Locate the SDS and device that was placed in a WARNING state from the MDM event logs.
- Preemptively clear the device error from scli.
- Clearing a device error when it is showing no errors is not harmful and only clears flags that are not needed.
scli --clear_sds_device_error --sds_name <SDS_NAME> --device_path <PATH>
Impacted Versions
PowerFlex 3.6.x
Fixed In Version
PowerFlex 3.6.4