VxRail:单个出现故障的 NVMe 磁盘会导致整个 VSAN 群集出现故障并显示 IO 错误

Summary: 单个出现故障的 NVMe 磁盘会导致整个 vSAN 群集出现故障并显示 IO 错误。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

  • VxRail 7.x 代码版本
  • NVMe 磁盘故障
  • 在 hostd.log 上报告 PDL 事件
2024-05-23T04:49:18.562+0100 info hostd[61598519] [⋮ sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
-->    key = 135,
-->    chainId = -1,
-->    createdTime = "1970-01-01T00:00:00Z",
-->    userName = "",
-->    host = (vim.event.HostEventArgument) {
-->       name = "host.domain.com",
-->       host = 'vim.HostSystem:ha-host'
-->    },
-->    eventTypeId = "esx.problem.vob.vsan.pdl.offline",
-->    arguments = (vmodl.KeyAnyValue) [
-->       (vmodl.KeyAnyValue) {
-->          key = "1",
-->          value = "52071875-618f-3f4b-27f5-89ab5d2a9bf6"
-->       }
-->    ],
-->    objectId = "ha-host",
-->    objectType = "vim.HostSystem",
--> }
 
  • vSAN 管理日志报告“描述符卡住”消息
2024-05-23T04:49:09.355+0100 cpu99:2100019)DOM: DOM2PCPrintDescriptor:2121: [1287682095:0x45dabbe1f140] => Stuck descriptor
2024-05-23T04:49:10.942+0100 cpu122:2100017)DOM: DOM2PCPrintDescriptor:2121: [11772501:0x45dabbf65d40] => Stuck descriptor
2024-05-23T06:02:49.344+0100 cpu73:2100015)DOM: DOM2PCPrintDescriptor:2121: [30274285827:0x45dabbf2a840] => Stuck descriptor
 
  • VMkernel 报告 IO 卡住
2024-05-23T04:49:09.379+0100 cpu43:2099914)DOM: DOM2PCPrintDescriptor:2121: [14235787:0x45bac61c17c0] => Stuck descriptor
2024-05-23T04:50:21.899+0100 cpu20:67978583)ScsiDeviceIO: 12480: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C is stuck (WorldID 0,
md 0x28, CmdSN 7daf4). Issuing yellow notification to the
2024-05-23T04:50:21.899+0100 cpu20:67978583)ScsiDeviceIO: 12559: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________016CD616E28EE38C
2024-05-23T04:51:33.949+0100 cpu64:67978582)ScsiDeviceIO: 12527: Task mgmt request issued to device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C is stuck (WorldID 0,
md 0x28, CmdSN 7daf4). Issuing red notification to the
2024-05-23T04:51:33.949+0100 cpu64:67978582)ScsiDeviceIO: 12559: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Ent_NVMe_CM6_RI_7.68TB_____________0XXXXXXXXXEE38C

Cause

  • 如果磁盘上有待处理的 IO,则静默作不会完成。
  • 如果 IO 卡住,有时如果驱动程序无法完成 IO,则 IO 永远不会完成。
  • 在设备层出现卡住的 IO 的情况下,如果由于瞬时错误处理或 APD 处理或任何 DECOM作而启动停顿作,则停顿永远不会完成,因为待处理的卡住 IO 导致清理不会继续。
  • 这会导致争用情况。

 

Resolution

此问题已在 ESXi 7.0U3 P09 或最新版本(位于 VxRail 代码版本 7.0.520)中得到解决。

主机受到影响时的解决方法:

  1. 在节点上取消注册虚拟机
  2. 将主机置于维护模式
  3. 重新启动主机。
  4. 如果磁盘报告硬件故障,请更换 NVMe 磁盘。
  5. 按如下所示评估升级以获得永久修复。

Affected Products

VxRail, VxRail Appliance Series, VxRail Software
Article Properties
Article Number: 000225946
Article Type: Solution
Last Modified: 11 Apr 2025
Version:  8
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.