PowerFlex 3.X: SDSs Decouple After MDM Ownership Change
Summary: Multiple SDSs decouple after an MDM ownership was issued.
Symptoms
MDM ownership changes by an admin-initiated activity.
MDM ownership changed due to a failure on one of the MDM servers.
MDM event logs showing that a new Primary MDM node has taken ownership:
2024-07-04 07:45:28.088000:0114714:MDM_BECOMING_MASTER WARNING This MDM is switching to Master mode. MDM will start running.
Multiple SDSs reconnect to the new Primary MDM and shortly disconnect:
2024-07-04 07:45:41.218000:0115810:SDS_RECONNECTED INFO SDS: sds1 (ID 13f4fe8800000001) reconnected 2024-07-04 07:45:41.377000:0115811:SDS_RECONNECTED INFO SDS: sds2 (ID 13f4fe3b00000002) reconnected 2024-07-04 07:45:43.194000:0115990:SDS_DECOUPLED ERROR SDS: sds1 (id: 13f4fe8800000001) decoupled. 2024-07-04 07:45:44.197000:0116051:SDS_DECOUPLED ERROR SDS: sds2 (id: 13f4fe3b00000002) decoupled. 2024-07-04 07:45:45.192000:0115809:MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 2024-07-04 07:45:45.786000:0116061:MDM_DATA_FAILED CRITICAL The system is now in DATA FAILURE state. Some data is unavailable.
In this case, the SDSes that decoupled did eventually stay connected to the MDM.
SDS trace logs showing that it is blocking itself:
04/07 07:45:39.606135 0x7fc708919db8:kalive_IsBlocked:00570: Keep-Alive (KA) is blocked: TRUE 04/07 07:45:46.578166 0x7fc702567db8:kalive_ShouldSendKeepAlive:00345: KA aborted because SDS is blocked
The SDS process implicates itself if it believes it has a local issue. It does this to prevent I/O issues and attempts to reconnect to the Primary MDM.
Impact
One or more storage pools experience degraded capacity.
One or more storage pools experience failed capacity.
Cause
When a new MDM takes ownership of the cluster, all SDSs connects to the new primary MDM. During this transition, the SDSs receive reconfiguration commands from the MDM. In rare cases, the SDSs might complete the MDM's reconfiguration instructions but then have to wait for further guidance. If the MDM does not provide additional instructions within 5 seconds, the SDSs marks themselves as blocked and attempt to reconnect to the MDM. This issue is more common in very large environments with 70 or more SDSs, where the MDM may not be fast enough to send the necessary instructions, causing the SDSs to disconnect and try again.
Resolution
To prevent this issue from occurring, upgrade the PowerFlex software to a version that includes the fix.
Impacted Version
PowerFlex 3.6 and older
Fixed In Version
PowerFlex 3.6.1 and newer