PowerFlex SDS 进程不稳定导致 I/O 错误
Summary: SDS 反复无响应,导致 I/O 错误,因为 SDS 未从系统中撤出。
Acest articol se aplică pentru
Acest articol nu se aplică pentru
Acest articol nu este legat de un produs specific.
Acest articol nu acoperă toate versiunile de produs existente.
Symptoms
在 MDM 事件中,可能会观察到重复的 SDS 断开连接(重复分离),并且可能有应用程序和 SDC 报告 I/O 错误。在 MDM 事件中观察到 SDS 不稳定:
# grep ee9b4eb200000002 events.txt | egrep -v "(OSC|SDC_CON|SDC_DISC)" 4284507 2020-10-26 23:38:02.330 SDS_RECONNECTED INFO SDS: sds-********v004 (ID ee9b4eb200000002) reconnected 4284546 2020-10-26 23:38:17.103 SDS_RECONNECTED INFO SDS: sds-********v004 (ID ee9b4eb200000002) reconnected 4284674 2020-10-26 23:40:12.318 SDS_RECONNECTED INFO SDS: sds-********v004 (ID ee9b4eb200000002) reconnected
SDC 与 SDS 断开连接,例如从 ESXi 断开连接:
vmkernel.0:2020-10-27T04:01:01.193Z cpu56:66319)WARNING: [14896504445] Disconnected from SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:02.296Z cpu32:66320)WARNING: [14896505547] Connected to SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:18.232Z cpu35:66319)WARNING: [14896521482] Disconnected from SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:19.332Z cpu35:66319)WARNING: [14896522582] Connected to SDS with ID ee9b4eb200000002 vmkernel.0:2020-10-27T04:01:34.769Z cpu53:66320)WARNING: [14896538017] Disconnected from SDS with ID ee9b4eb200000002
SDC 上显示 I/O 错误:
2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[14895126141] IO-ERROR Type TEST_AND_SET. comb: 55880098015. offsetInComb 2721096. SizeInLB 1. SDS_ID 0. Comb Gen 4619. Head Gen 4b30. StartLB ad48. 2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:512 :Vol ID 0x735105ff0000001c. Last vol network error status NOT_CONN(4) Reason (ABORTED) RC (ABORTED) Retry count (5) chan (0) . . . 2020-10-27T04:08:20.234Z cpu35:66313)WARNING: ScaleIO netCon_IsKaNeeded:3761 :CON 0x439dc29f6700 didn't receive message for 30 iterations. Marking as down 2020-10-27T04:08:20.234Z cpu18:66894)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f5efc socket 0x439dc29f6418 2020-10-27T04:08:20.234Z cpu33:66806)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f817c socket 0x439dc29f8698 2020-10-27T04:08:20.234Z cpu0:66879)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed to receive 128 data PTR 0x439dc29f6a7c socket 0x439dc29f6f98 2020-10-27T04:08:20.234Z cpu23:66319)WARNING: [14896943442] Disconnected from SDS with ID ee9b4eb200000002 2020-10-27T04:08:23.246Z cpu37:65868)Res6: 2346: All helpers quiesced (12 cancelled) for vol 'SD4W21AVxFlexCU03': 1280 LFBCs, 20/1 buckets allocated (4 KB), 1 flush, 0 helpers
如果重复的 SDS 断开连接和重新连接,则可能会发生知识库文章中描述的问题。在下面的示例中,NVDIMM 硬件 (HW) 问题导致 SIGBUS 错误(内存访问错误),并导致 SDS 崩溃,并显示信号 7。exp.0:
26/10 23:37:55.305617 Termination due to signal 7. PID 2601 Faulting address 0x7efb85004000. errno 0 26/10 23:37:55.306321 Writing backtraces for all UMTs: 26/10 23:38:10.132585 Termination due to signal 7. PID 99889 Faulting address 0x7f5485004000. errno 0 26/10 23:38:10.133167 Writing backtraces for all UMTs:
Messages:
Oct 26 23:37:55 kernel: mce: Uncorrected hardware memory error in user-access at 3d84e04440 Oct 26 23:37:55 kernel: MCE 0x3d84e04: Killing sds-3.0.1000.20:2601 due to hardware memory corruption Oct 26 23:37:55 kernel: MCE 0x3d84e04: dax page page recovery: Recovered Oct 26 23:37:55 kernel: sds-3.0.1000.20:4006 conflicting memory types 3d84e04000-3d84e05000 uncached-minus<->write-back Oct 26 23:37:55 kernel: reserve_memtype failed [mem 0x3d84e04000-0x3d84e04fff], track uncached-minus, req uncached-minus Oct 26 23:37:55 kernel: Could not invalidate pfn=0x3d84e04 from 1:1 map Oct 26 23:37:56 sh: abrt-dump-oops: Found oopses: 1 Oct 26 23:37:56 sh: abrt-dump-oops: Creating problem directories Oct 26 23:37:56 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on Oct 26 23:37:56 systemd: Configuration file /opt/nsr/admin/networker.service is marked executable.
Cause
- 软件 (SW) 或硬件故障会导致 SDS 进程无响应并与 MDM 断开连接。
- SDS 从崩溃中恢复,并通过“重新配置阶段”,从 MDM 的角度来看,此 SDS 对于所有其他系统组件(包括 SDC)都已正式发布。
- 15 秒后,SDC 重试 I/O(默认值),同时 SDS 再次无响应,如“1”点中所述。
- I/O 在超时时失败,SDC 应用程序报告 I/O 错误。
- 步骤“2”→“4”可能会反复出现,直到从系统中撤出此 SDS。
Resolution
系统正常工作。
选项 1:
从群集中删除 SDS。您可以随时删除 SDS,无需停机。在删除过程中,关联的数据将复制到不同的节点。删除过程是异步的,可能需要很长时间。
修复导致 SDS 不稳定的硬件和软件问题,并将 SDS 返回到群集。
选项 2:
监视系统,如果 SDS 在类似情况下开始摆动,请通过在 SDS 上运行以下命令来停止 SDS 服务:
提醒:停止 SDS 服务会触发重建。问题解决后,通过在 SDS 上运行以下命令重新启动 SDS 服务:
选项 1:
从群集中删除 SDS。您可以随时删除 SDS,无需停机。在删除过程中,关联的数据将复制到不同的节点。删除过程是异步的,可能需要很长时间。
提醒:如果卷使用此 SDS 的容量,并且由于缺少可用空间而无法替换容量,则删除将失败。
修复导致 SDS 不稳定的硬件和软件问题,并将 SDS 返回到群集。
选项 2:
监视系统,如果 SDS 在类似情况下开始摆动,请通过在 SDS 上运行以下命令来停止 SDS 服务:
/opt/emc/scaleio/sds/bin/delete_service.sh
提醒:停止 SDS 服务会触发重建。问题解决后,通过在 SDS 上运行以下命令重新启动 SDS 服务:
/opt/emc/scaleio/sds/bin/create_service.sh
Additional Information
计划为 PowerFlex 软件版本 4.0 提供此类事件的弹性。
Produse afectate
PowerFlex rack, VxRackProprietăți articol
Article Number: 000181511
Article Type: Solution
Ultima modificare: 07 mai 2026
Version: 3
Găsiți răspunsuri la întrebările dvs. de la alți utilizatori Dell
Servicii de asistență
Verificați dacă dispozitivul dvs. este acoperit de serviciile de asistență.