PowerFlex:SDS 死机,命令耗时太长 (TGT_MSG_TYPE__ADD_DEV),尝试清除设备错误

摘要: 在某些底层磁盘条件下运行“清除设备错误”尝试时,软件定义的存储 (SDS) 进程崩溃。

本文适用于 本文不适用于 本文并非针对某种特定的产品。 本文并非包含所有产品版本。

症状

观察到磁盘的硬件错误表明磁盘响应缓慢。

作系统消息:

Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: attempting task abort! scmd(ffffxxxxxa010540)
Mar 14 23:09:49 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 24 48 36 60 00 00 08 00
Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58cexxxxx1139e9a), phy(3)
Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x50005xxxxx34abff), slot(3)
Mar 14 23:09:49 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name(     )
:
Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxxa010540)
Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Mar 14 23:13:04 esx04-svm kernel: sd 6:0:3:0: [sde] CDB: Write(10) 2a 00 1b 6f eb 20 00 00 18 00
Mar 14 23:13:04 esx04-svm kernel: blk_update_request: I/O error, dev sde, sector 460319520
:
Mar 14 23:13:04 esx04-svm systemd: sds.service: main process exited, code=exited, status=254/n/a
Mar 14 23:13:04 esx04-svm systemd: Unit sds.service entered failed state.
Mar 14 23:13:04 esx04-svm systemd: sds.service failed.
Mar 14 23:13:04 esx04-svm systemd: sds.service has no holdoff time, scheduling restart.
Mar 14 23:13:04 esx04-svm systemd: Stopped scaleio sds.
Mar 14 23:13:04 esx04-svm systemd: Started scaleio sds.

SDS 跟踪日志: 

14/03 23:10:04.535824 0x7fadbde2bdb8:contDevMngr_HandleLongInflightIoViolation:05927: IO on devId: fb775d1000030002 (/dev/sde) took too long, High threshold exceeded - waited for reaper 45830 millis
:
14/03 23:13:11.395813 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO)
14/03 23:13:11.900810 0x7f4832c55db8:mosAsyncIO_DoIOTimedSyncIntern:01564: Time budget exceeded during IO to /dev/sde (Give-up on inflight IO)
14/03 23:13:11.900846 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: FALSE WriteError: TRUE
14/03 23:13:11.901335 0x7f4832c55db8:contDev_SendDeviceError:02801: Sending device error to MDM: DevId:fb775dxxxxx30002 deviceName: /dev/sde readError: TRUE WriteError: TRUE

SDS 将其标记为错误状态。

Metadata Manager (MDM) 事件:

2021-03-14 23:13:12.031000:0003180:SDS_DEV_ERROR_REPORT    ERROR    Device error reported on SDS: SDS_x.x.x.x, Device: N/A-0002. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS

稍后,用户尝试 Clear Device Error 将其添加回系统的作:

2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED    INFO     Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A.

当用户尝试 Clear Device Error 作 SDS 进程死机并与 MDM 断开连接。因此, Clear Device Error 无法在内部完成命令,并且无法将磁盘添加回 SDS。
 

SDS 进程因在处理内部 SDS 时达到内部超时而死机 Add Dev 命令作为 Clear Device Error 操作。

SDS exp.0:

16/03 22:25:06.340250 [CHOKE_POINT] Panic in file /data/builds/workspace/ScaleIO-Common-Job@2/src/tgt/cont/cont_cmd.c, line 11459, function contCmd_TimerExpiredInner, PID 29524.Panic Expression ALWAYS_ASSERT Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV) - Restarting.
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosDbg_PanicPrepare+0x135) [0x8dd885]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(contCmd_TimerExpiredInner+0x6e) [0x80c52e]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208() [0x73e16b]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimerQ_PollUnlocked+0xb2) [0x81f372]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_PollQRange+0xf8) [0x821088]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosTimer_Loop+0x28) [0x8210e8]
/opt/emc/scaleio/sds/bin/sds-3.0.1000.208(mosOsThrd_StartFunc+0x118) [0x67e798]
/lib64/libpthread.so.0(+0x7dd5) [0x7f17301f8dd5]
16/03 22:25:07.339909 Termination due to SIGALRM. PID 29524

SDS 跟踪日志:

16/03 22:24:06.338880 0x7f16e6cefdb8:contDev_Add:02034: Add Dev, path found in /dev/sde
16/03 22:24:16.344801 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxddb8, context=2277
16/03 22:24:22.344795 (nil):contCmd_SmallTimerExpired:11491: Long operation detected - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxx3db8, context=2277
16/03 22:25:06.339793 (nil):contCmd_TimerExpiredInner:11454: Command took too long - 2277 (TGT_MSG_TYPE__ADD_DEV), Umt 0x7f1xxxxxdb8, context=2277

作系统消息:

Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: handle(0x000d), sas_address(0x58ce38ee21139e9a), phy(3)
Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure logical id(0x5000xxxxx234abff), slot(3)
Mar 16 22:25:05 esx04-svm kernel: scsi target6:0:3: enclosure level(0x0001), connector name(     )
Mar 16 22:25:05 esx04-svm kernel: sd 6:0:3:0: task abort: SUCCESS scmd(ffffxxxxx9e9a140)
Mar 16 22:25:07 esx04-svm systemd-udevd: worker [2668] /devices/pci0000:00/0000:00:18.1/0000:1c:00.0/host6/port-6:0/expander-6:0/port-6:0:3/end_device-6:0:3/target6:0:3/6:0:3:0/block/sde is taking a long time

 

系统处于 DEGRADED 状态,并将触发重建:

2021-03-16 22:24:06.328000:0003258:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: 'admin'. [60856484] SDS Device: ID:fb775dxxxxx30002@N/A.
2021-03-16 22:24:06.328000:0003259:CLI_COMMAND_SUCCEEDED INFO Command clear_sds_device_error succeeded. [60856484]
:
2021-03-16 22:24:16.856000:0003262:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled.
2021-03-16 22:24:23.862000:0003263:SDS_DECOUPLED ERROR SDS: SDS_x.x.x.x (id: yyyyyyyy00000003) decoupled.
2021-03-16 22:24:23.863000:0003264:SDS_IN_COOL_DOWN WARNING SDS: SDS_x.x.x.x (ID yyyyyyyy00000003) will disconnect from MDM for 15 seconds. failed to reconnect multiple times
2021-03-16 22:24:24.108000:0003265:MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.

原因

有缺陷的磁盘不稳定且响应缓慢。因此,SDS 无法及时将其添加回系统,并惊慌失措以防止进程挂起。

在某些有缺陷的磁盘条件下可能会观察到此症状。PowerFlex 和 SDS 按设计方式工作。

解决方案

与硬件支持团队合作,更换有缺陷的磁盘和硬件以解决此问题。

其他信息

有关常规的清除设备错误过程,请参阅清除设备错误过程 

受影响的产品

PowerFlex Software
文章属性
文章编号: 000185002
文章类型: Solution
上次修改时间: 07 11月 2025
版本:  6
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。