PowerFlex:SDS 死机,命令耗时太长 (TGT_MSG_TYPE__ADD_DEV),尝试清除设备错误
摘要: 在某些底层磁盘条件下运行“清除设备错误”尝试时,软件定义的存储 (SDS) 进程崩溃。
本文适用于
本文不适用于
本文并非针对某种特定的产品。
本文并非包含所有产品版本。
症状
观察到磁盘的硬件错误表明磁盘响应缓慢。
作系统消息:
Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: attempting task abort! scmd(ffffxxxxxa010540) Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 24 48 36 60 00 00 08 00 Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58cexxxxx1139e9a), phy(3) Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x50005xxxxx34abff), slot(3) Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name( ) : Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxxa010540) Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 1b 6f eb 20 00 00 18 00 Mar 14 23:13:04 esx04-svm kernel: blk_update_request: I/O error, dev sde, sector 460319520 : Mar 14 23:13:04 esx04-svm systemd: sds.service: main process exited, code=exited, status=254/n/a Mar 14 23:13:04 esx04-svm systemd: Unit sds.service entered failed state. Mar 14 23:13:04 esx04-svm systemd: sds.service failed. Mar 14 23:13:04 esx04-svm systemd: sds.service has no holdoff time, scheduling restart. Mar 14 23:13:04 esx04-svm systemd: Stopped scaleio sds. Mar 14 23:13:04 esx04-svm systemd: Started scaleio sds.
SDS 跟踪日志:
14/03 23:10:04.535824 0x7fadbde2bdb8:contDevMngr_HandleLongInflightIoViolation:05927: IO on devId: fb775d1000030002 (/dev/sde) took too long, High threshold exceeded - waited for reaper 45830 millis : 14/03 23:13:11.395813 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO) 14/03 23:13:11.900810 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO) 14/03 23:13:11.900846 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: FALSE WriteError: TRUE 14/03 23:13:11.901335 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: TRUE WriteError: TRUE
SDS 将其标记为错误状态。
Metadata Manager (MDM) 事件:
2021-03-14 23:13:12.031000:0003180:SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: SDS_x.x.x.x, Device: N/A-0002. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS
稍后,用户尝试 Clear Device Error 将其添加回系统的作:
2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A.
当用户尝试 Clear Device Error 作 SDS 进程死机并与 MDM 断开连接。因此, Clear Device Error 无法在内部完成命令,并且无法将磁盘添加回 SDS。
SDS 进程因在处理内部 SDS 时达到内部超时而死机 Add Dev 命令作为 Clear Device Error 操作。
SDS exp.0:
16/03 22:25:06.340250 [CHOKE_POINT] Panic in file /data/builds/workspace/ScaleIO-Common-Job@2/src/tgt/cont/cont_cmd.c, line 11459, function contCmd_TimerExpiredInner, PID 29524.Panic Expression ALWAYS_ASSERT Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV) - Restarting. /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosDbg_PanicPrepare+0x135) [0x8dd885] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(contCmd_TimerExpiredInner+0x6e) [0x80c52e] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208() [0x73e16b] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimerQ_PollUnlocked+0xb2) [0x81f372] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_PollQRange+0xf8) [0x821088] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_Loop+0x28) [0x8210e8] /opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosOsThrd_StartFunc+0x118) [0x67e798] /lib64/libpthread.so.0(+0x7dd5) [0x7f17301f8dd5] 16/03 22:25:07.339909 Termination due to SIGALRM. PID 29524
SDS 跟踪日志:
16/03 22:24:06.338880 0x7f16e6cefdb8:contDev_Add:02034: Add Dev, path found in /dev/sde 16/03 22:24:16.344801 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxddb8, context=2277 16/03 22:24:22.344795 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxx3db8, context=2277 16/03 22:25:06.339793 (nil):contCmd_TimerExpiredInner:11454: Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxdb8, context=2277
作系统消息:
Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58ce38ee21139e9a), phy(3) Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x5000xxxxx234abff), slot(3) Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name( ) Mar 16 22:25:05 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxx9e9a140) Mar 16 22:25:07 esx04-svm systemd-udevd: worker [2668] /devices/pci0000:00/0000:00:18.1/0000:1c:00.0/host6/port-6:0/expander-6:0/port-6:0:3/end_device-6:0:3/target6:0:3/6:0:3:0/block/sde is taking a long time
系统处于 DEGRADED 状态,并将触发重建:
2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A. 2021-03-16 22:24:06.328000:0003259:CLI_COMMAND_SUCCEEDED INFO Command clear_sds_device_error succeeded. [60856484] : 2021-03-16 22:24:16.856000:0003262:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled. 2021-03-16 22:24:23.862000:0003263:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled. 2021-03-16 22:24:23.863000:0003264:SDS_IN_COOL_DOWN WARNING SDS: SDS_x.x.x.x (ID yyyyyyyy00000003) will disconnect from MDM for 15 seconds. failed to reconnect multiple times 2021-03-16 22:24:24.108000:0003265:MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
原因
有缺陷的磁盘不稳定且响应缓慢。因此,SDS 无法及时将其添加回系统,并惊慌失措以防止进程挂起。
在某些有缺陷的磁盘条件下可能会观察到此症状。PowerFlex 和 SDS 按设计方式工作。
解决方案
与硬件支持团队合作,更换有缺陷的磁盘和硬件以解决此问题。
其他信息
有关常规的清除设备错误过程,请参阅清除设备错误过程
受影响的产品
PowerFlex Software文章属性
文章编号: 000185002
文章类型: Solution
上次修改时间: 07 11月 2025
版本: 6
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。