VxRail:單一的 NVMe 磁碟故障會導致整個 VSAN 叢集發生故障並發生 IO 錯誤

Summary: 單一 NVMe 磁碟故障會導致整個 vSAN 叢集發生故障並出現 IO 錯誤。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

  • VxRail 7.x 程式碼版本
  • NVMe 磁碟故障
  • PDL 事件會在hostd.log上報告
2024-05-23T04:49:18.562+0100 info hostd[61598519] [⋮ sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
-->    key = 135,
-->    chainId = -1,
-->    createdTime = "1970-01-01T00:00:00Z",
-->    userName = "",
-->    host = (vim.event.HostEventArgument) {
-->       name = "host.domain.com",
-->       host = 'vim.HostSystem:ha-host'
-->    },
-->    eventTypeId = "esx.problem.vob.vsan.pdl.offline",
-->    arguments = (vmodl.KeyAnyValue) [
-->       (vmodl.KeyAnyValue) {
-->          key = "1",
-->          value = "52071875-618f-3f4b-27f5-89ab5d2a9bf6"
-->       }
-->    ],
-->    objectId = "ha-host",
-->    objectType = "vim.HostSystem",
--> }
 
  • vSAN 管理記錄報告「停滯的描述項」訊息
2024-05-23T04:49:09.355+0100 cpu99:2100019)DOM: DOM2PCPrintDescriptor:2121: [1287682095:0x45dabbe1f140] => Stuck descriptor
2024-05-23T04:49:10.942+0100 cpu122:2100017)DOM: DOM2PCPrintDescriptor:2121: [11772501:0x45dabbf65d40] => Stuck descriptor
2024-05-23T06:02:49.344+0100 cpu73:2100015)DOM: DOM2PCPrintDescriptor:2121: [30274285827:0x45dabbf2a840] => Stuck descriptor
 
  • VMkernel 報告停滯的 IO
2024-05-23T04:49:09.379+0100 cpu43:2099914)DOM: DOM2PCPrintDescriptor:2121: [14235787:0x45bac61c17c0] => Stuck descriptor
2024-05-23T04:50:21.899+0100 cpu20:67978583)ScsiDeviceIO: 12480: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C is stuck (WorldID 0,
md 0x28, CmdSN 7daf4). Issuing yellow notification to the
2024-05-23T04:50:21.899+0100 cpu20:67978583)ScsiDeviceIO: 12559: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________016CD616E28EE38C
2024-05-23T04:51:33.949+0100 cpu64:67978582)ScsiDeviceIO: 12527: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C is stuck (WorldID 0,
md 0x28, CmdSN 7daf4). Issuing red notification to the
2024-05-23T04:51:33.949+0100 cpu64:67978582)ScsiDeviceIO: 12559: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C

Cause

  • 如果磁碟上有待處理的 IO,靜止作業不會完成。
  • 如果 IO 停滯,則驅動程式有時無法完成 IO 時,有時絕不會預期完成 IO。
  • 在裝置層卡住 IO,且如果靜止作業是由於暫時性錯誤處理或 APD 處理或任何 DECOM 作業而啟動,則靜止永遠不會完成,因為清理會由於擱置中的 IO 而無法繼續進行。
  • 這會導致競爭狀況。

 

Resolution

問題已在 ESXi 7.0U3 P09 或最新版本中解決,亦適用於 VxRail 程式碼版本 7.0.520。

主機受到影響的因應措施:

  1. 取消註冊節點上的虛擬機器
  2. 讓主機處於維護模式
  3. 將主機重新開機。
  4. 如果磁碟回報硬體故障,請更換 NVMe 磁碟。
  5. 請按如下方式評估升級,以取得永久修正。

Affected Products

VxRail, VxRail Appliance Series, VxRail Software
Article Properties
Article Number: 000225946
Article Type: Solution
Last Modified: 11 Apr 2025
Version:  8
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.