PowerFlex IO Errors Received After SDC Failed To Reconnect With A Decoupled SDS
Summary: Issue Description A client host reports I/O errors when a disconnected SDS reconnects to the system.
Symptoms
MDM events show that the SDS rejoins the cluster:
2022-05-17 03:17:00.127 SDS_RECONNECTED INFO SDS: () reconnected.
ESXi logs show recurring "failed to send for 180 interactions" events that eventually lead to I/O hard errors against the volume:
2022-05-17T04:33:11.925Z cpu53:2098393)WARNING: PowerFlex netCon_IsKaNeeded:4338 :Error: CON 0x45b9517e4d40 failed to send for 180 iterations. Marking as down 2022-05-17T04:33:11.925Z cpu53:2098393)WARNING: PowerFlex netCon_IsKaNeeded:4338 :Error: CON 0x45b9517e3f40 failed to send for 180 iterations. Marking as down .. 2022-05-17T04:33:21.275Z cpu36:2098514)ScsiDeviceIO: 4267: Cmd(0x45d96d40ab00) 0x28, CmdSN 0x800e0030 from world 3589861 to dev "" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0
Impact
SDC/s may receive I/O errors. SDC datastores may go into a Read-Only state.
Cause
When an SDC-SDS TCP connection is terminated suddenly and not in an orderly fashion (TCP FIN packets not sent), for example in cases of a switch failure or a power outage/hard reset of an SDS host, the sockets may remain open on the SDC side, and remain stuck in the "security handshake" phase due to a software code issue that was introduced in version 3.5.
In this situation, the state of these stuck connections is considered "not connected", and even if the SDS host becomes reachable again, the "stuck" connections are not renewed.
Due to platform-specific implementations of the SDC network layer, this issue cannot occur in Linux SDCs.
Resolution
Workaround
The "SDS Proxy" feature that was introduced in version 3.6 greatly reduces the probability of the issue occurring.
An afflicted SDC host can be rebooted.
Impacted Versions
PowerFlex SDC version 3.5 and higher
Fixed In Version
PowerFlex 3.5.1.7
PowerFlex 3.6.0.5