PowerFlex: SDS panic with Command took too long (TGT_MSG_TYPE__ADD_DEV) while Clear Device Error attempt
Summary: Software Defined Storage (SDS) process panic while Clear Device Error attempt runs under certain underlying disk conditions.
Symptoms
Disk observed hardware errors indicate slow disk response.
OS messages:
Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: attempting task abort! scmd(ffffxxxxxa010540) Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 24 48 36 60 00 00 08 00 Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58cexxxxx1139e9a), phy(3) Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x50005xxxxx34abff), slot(3) Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name( ) : Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxxa010540) Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 1b 6f eb 20 00 00 18 00 Mar 14 23:13:04 esx04-svm kernel: blk_update_request: I/O error, dev sde, sector 460319520 : Mar 14 23:13:04 esx04-svm systemd: sds.service: main process exited, code=exited, status=254/n/a Mar 14 23:13:04 esx04-svm systemd: Unit sds.service entered failed state. Mar 14 23:13:04 esx04-svm systemd: sds.service failed. Mar 14 23:13:04 esx04-svm systemd: sds.service has no holdoff time, scheduling restart. Mar 14 23:13:04 esx04-svm systemd: Stopped scaleio sds. Mar 14 23:13:04 esx04-svm systemd: Started scaleio sds.
SDS trace log:
14/03 23:10:04.535824 0x7fadbde2bdb8:contDevMngr_HandleLongInflightIoViolation:05927: IO on devId: fb775d1000030002 (/dev/sde) took too long, High threshold exceeded - waited for reaper 45830 millis : 14/03 23:13:11.395813 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO) 14/03 23:13:11.900810 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO) 14/03 23:13:11.900846 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: FALSE WriteError: TRUE 14/03 23:13:11.901335 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: TRUE WriteError: TRUE
The SDS marked it as in an error state.
Metadata Manager (MDM) events:
2021-03-14 23:13:12.031000:0003180:SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: SDS_x.x.x.x, Device: N/A-0002. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS
Later, a user attempted the Clear Device Error operation to add it back to the system:
2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A.
When the user attempted the Clear Device Error operation the SDS process panicked and disconnected from MDM. As a result, the Clear Device Error command could not be completed internally, and the disk could not be added back to the SDS.
The SDS process panicked due to reaching the internal timeout while handling the internal Add Dev command as part of the Clear Device Error operation.
SDS exp.0:
16/03 22:25:06.340250 [CHOKE_POINT] Panic in file /data/builds/workspace/ScaleIO-Common-Job@2/src/tgt/cont/cont_cmd.c, line 11459, function contCmd_TimerExpiredInner, PID 29524.Panic Expression ALWAYS_ASSERT Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV) - Restarting. /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosDbg_PanicPrepare+0x135) [0x8dd885] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(contCmd_TimerExpiredInner+0x6e) [0x80c52e] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208() [0x73e16b] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimerQ_PollUnlocked+0xb2) [0x81f372] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_PollQRange+0xf8) [0x821088] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_Loop+0x28) [0x8210e8] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosOsThrd_StartFunc+0x118) [0x67e798] /lib64/libpthread.so.0(+0x7dd5) [0x7f17301f8dd5] 16/03 22:25:07.339909 Termination due to SIGALRM. PID 29524
SDS trace log:
16/03 22:24:06.338880 0x7f16e6cefdb8:contDev_Add:02034: Add Dev, path found in /dev/sde 16/03 22:24:16.344801 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxddb8, context=2277 16/03 22:24:22.344795 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxx3db8, context=2277 16/03 22:25:06.339793 (nil):contCmd_TimerExpiredInner:11454: Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxdb8, context=2277
OS messages:
Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58ce38ee21139e9a), phy(3) Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x5000xxxxx234abff), slot(3) Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name( ) Mar 16 22:25:05 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxx9e9a140) Mar 16 22:25:07 esx04-svm systemd-udevd: worker [2668] /devices/pci0000:00/0000:00:18.1/0000:1c:00.0/host6/port-6:0/expander-6:0/port-6:0:3/end_device-6:0:3/target6:0:3/6:0:3:0/block/sde is taking a long time
The system is in a DEGRADED state, and a rebuild is triggered:
2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A. 2021-03-16 22:24:06.328000:0003259:CLI_COMMAND_SUCCEEDED INFO Command clear_sds_device_error succeeded. [60856484] : 2021-03-16 22:24:16.856000:0003262:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled. 2021-03-16 22:24:23.862000:0003263:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled. 2021-03-16 22:24:23.863000:0003264:SDS_IN_COOL_DOWN WARNING SDS: SDS_x.x.x.x (ID yyyyyyyy00000003) will disconnect from MDM for 15 seconds. failed to reconnect multiple times 2021-03-16 22:24:24.108000:0003265:MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
Cause
The defective disk was unstable with slow responses. Therefore, the SDS was unable to add it back to the system in a timely manner and panicked to prevent the process hang up.
This symptom is possibly observed under certain defective disk conditions. PowerFlex and SDS work as designed.
Resolution
Collaborate with the hardware support team to replace the defective disk and hardware to address the issue.