PowerFlex SDS 程序不穩定導致 I/O 錯誤

Summary: SDS 經常無回應,導致 I/O 錯誤,因為 SDS 不會從系統中撤出。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

在 MDM 事件中,可能會觀察到重複的 SDS 中斷連線 (重複解耦),並可能發生應用程式和 SDC 報告 I/O 錯誤。在 MDM 事件中觀察到 SDS 不穩定:

# grep ee9b4eb200000002 events.txt  | egrep -v "(OSC|SDC_CON|SDC_DISC)"
4284507 2020-10-26 23:38:02.330 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected 
4284546 2020-10-26 23:38:17.103 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected

4284674 2020-10-26 23:40:12.318 SDS_RECONNECTED           INFO     	 SDS: sds-********v004 (ID ee9b4eb200000002) reconnected

SDC 中斷與 SDS 的連線,例如與 ESXi 的連線:

vmkernel.0:2020-10-27T04:01:01.193Z cpu56:66319)WARNING: [14896504445] Disconnected from SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:02.296Z cpu32:66320)WARNING: [14896505547] Connected to SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:18.232Z cpu35:66319)WARNING: [14896521482] Disconnected from SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:19.332Z cpu35:66319)WARNING: [14896522582] Connected to SDS with ID ee9b4eb200000002
vmkernel.0:2020-10-27T04:01:34.769Z cpu53:66320)WARNING: [14896538017] Disconnected from SDS with ID ee9b4eb200000002

SDC 上出現 I/O 錯誤:

2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[14895126141] IO-ERROR Type TEST_AND_SET. comb: 55880098015. offsetInComb 2721096. SizeInLB 1. SDS_ID 0. Comb Gen 4619. Head Gen 4b30. StartLB ad48.
2020-10-27T03:38:02.752Z cpu32:66313)WARNING: ScaleIO mapVolIO_ReportIOErrorIfNeeded:512 :Vol ID 0x735105ff0000001c. Last vol network error status NOT_CONN(4) Reason (ABORTED) RC (ABORTED) Retry count (5) chan (0)
.
.
.
2020-10-27T04:08:20.234Z cpu35:66313)WARNING: ScaleIO netCon_IsKaNeeded:3761 :CON 0x439dc29f6700 didn't receive message for 30 iterations.  Marking as down
2020-10-27T04:08:20.234Z cpu18:66894)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f5efc socket 0x439dc29f6418
2020-10-27T04:08:20.234Z cpu33:66806)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f817c socket 0x439dc29f8698
2020-10-27T04:08:20.234Z cpu0:66879)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439dc29f6a7c socket 0x439dc29f6f98
2020-10-27T04:08:20.234Z cpu23:66319)WARNING: [14896943442] Disconnected from SDS with ID ee9b4eb200000002
2020-10-27T04:08:23.246Z cpu37:65868)Res6: 2346: All helpers quiesced (12 cancelled)  for vol 'SD4W21AVxFlexCU03': 1280 LFBCs, 20/1 buckets allocated (4 KB), 1 flush, 0 helpers

如果重複發生 SDS 中斷連線和重新連線,則可能會發生 KB 中所述的問題。在以下範例中,NVDIMM 硬體 (HW) 問題會導致 SIGBUS 錯誤 (記憶體存取不良),並導致 SDS 當機,並顯示訊號 7。經驗 0:

26/10 23:37:55.305617 Termination due to signal 7. PID 2601 Faulting address 0x7efb85004000. errno 0
26/10 23:37:55.306321 Writing backtraces for all UMTs:
26/10 23:38:10.132585 Termination due to signal 7. PID 99889 Faulting address 0x7f5485004000. errno 0
26/10 23:38:10.133167 Writing backtraces for all UMTs:

訊息:

Oct 26 23:37:55  kernel: mce: Uncorrected hardware memory error in user-access at 3d84e04440
Oct 26 23:37:55  kernel: MCE 0x3d84e04: Killing sds-3.0.1000.20:2601 due to hardware memory corruption
Oct 26 23:37:55  kernel: MCE 0x3d84e04: dax page page recovery: Recovered
Oct 26 23:37:55  kernel: sds-3.0.1000.20:4006 conflicting memory types 3d84e04000-3d84e05000 uncached-minus<->write-back
Oct 26 23:37:55  kernel: reserve_memtype failed [mem 0x3d84e04000-0x3d84e04fff], track uncached-minus, req uncached-minus
Oct 26 23:37:55  kernel: Could not invalidate pfn=0x3d84e04 from 1:1 map
Oct 26 23:37:56  sh: abrt-dump-oops: Found oopses: 1
Oct 26 23:37:56  sh: abrt-dump-oops: Creating problem directories
Oct 26 23:37:56  sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
Oct 26 23:37:56  systemd: Configuration file /opt/nsr/admin/networker.service is marked executable.

Cause

  1. 軟體 (SW) 或硬體故障會導致 SDS 程序沒有回應,並與 MDM 中斷連線。
  2. SDS 會從當機中還原,並通過「重新設定階段」,從 MDM 的角度來看,此 SDS 已正式可用,並適用於所有其他系統元件,包括 SDC。
  3. 15 秒後,SDC 重試 I/O (預設值),同時 SDS 再次無回應,如「1」點所述。
  4. I/O 在逾時時失敗,且 SDC 應用程式回報 I/O 錯誤。
  5. 步驟「2」→「4」可能會一次又一次地發生,直到從系統中撤出此 SDS。

Resolution

系統正在依設計運作。

選項 1:
從叢集中移除軟體定義儲存 (SDS)。您可以隨時刪除 SDS,無需停機時間。在刪除過程中,關聯的數據將複製到不同的節點。刪除過程是異步的,可能需要很長時間。
 
注意:如果磁碟區使用此 SDS 的容量,且由於缺少可用空間而無法更換容量,則移除將會失敗。

修正造成軟體定義儲存 (SDS) 不穩定的硬體和軟體問題,並將軟體定義儲存退回叢集。

選項 2:
監控系統,如果 SDS 在類似情況下開始變動,請在 SDS 上執行下列命令,以停止 SDS 服務:
 /opt/emc/scaleio/sds/bin/delete_service.sh

注意:停止 SDS 服務會觸發重建。問題解決後,請在 SDS 上執行下列命令,以重新啟動 SDS 服務:
 /opt/emc/scaleio/sds/bin/create_service.sh

Additional Information

PowerFlex 軟體版本 4.0 已規劃發生此類事件的復原能力。

Affected Products

PowerFlex rack, VxRack
Article Properties
Article Number: 000181511
Article Type: Solution
Last Modified: 07 May 2026
Version:  3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.