PowerEdge:Mellanox CX-5:开机自检时出现无法恢复的硬件错误

摘要: 戴尔客户可能会报告在配置(包括 Mellanox 固件、BIOS 和 iDRAC 版本的特定组合)的开机自检时出现不可恢复的硬件错误。 虽然在具有 CX-5 卡的 VMware 上报告了此问题,但其他 Dell/Mellanox 硬件可能会受到影响。

本文适用于 本文不适用于 本文并非针对某种特定的产品。 本文并非包含所有产品版本。

症状

遇到此问题后,设备将处于无法正常工作的状态,并可能导致 ESXi 主机崩溃 (PSOD)。
问题签名可能包括(但不限于)以下引导控制台和 vmkernel 日志消息:

使用 OS + CX5 引导系统。控制台报告以下内容:

2021-12-21T10:29:46.037Z cpu22:2098137)WARNING:  device's health compromised
2021-12-21T10:29:46.037Z cpu22:2098137) assertVar[0] 0x00000002
2021-12-21T10:29:46.037Z cpu22:2098137) assertVar[1] 0x000ddf7c
2021-12-21T10:29:46.037Z cpu22:2098137) assertVar[2] 0x00000000
2021-12-21T10:29:46.037Z cpu22:2098137) assertVar[3] 0x00000000
2021-12-21T10:29:46.037Z cpu22:2098137) assertVar[4] 0x00000000
2021-12-21T10:29:46.037Z cpu22:2098137) assertExitPtr 0x00804b98
2021-12-21T10:29:46.037Z cpu22:2098137) assertCallra 0x00804e40
2021-12-21T10:29:46.037Z cpu22:2098137) firmwareVersion 0x101a1770
2021-12-21T10:29:46.037Z cpu22:2098137) hwId 0x0000020d
2021-12-21T10:29:46.037Z cpu22:2098137) iriscIndex 5
2021-12-21T10:29:46.037Z cpu22:2098137) synd 0x8: unrecoverable hardware error
2021-12-21T10:29:46.037Z cpu22:2098137) extSynd 0x0087
2021-12-21T10:29:46.037Z cpu38:2098149)WARNING:  handling bad device here 


此外,在作系统日志中:

# grep "unrecoverable hardware error" /var/log/messages
2021-12-17T05:21:35.159473-06:00 <0.7> MN-R640-77-1(id1) /boot/kernel.amd64/kernel: mlx5_core: INFO: synd 0x8: unrecoverable hardware error 


 

 

 

原因

BMC 通过 SMBUS 打开通道,并在某个时间点将介质移至 PCIe over VDM 以进行边带通信,而 ConnectX-5 设备不处理转换。

解决方案

戴尔工程部门已意识到此问题, 请勿更换硬件

此行为已在 Mellanox 固件中得到解决 16.32.20.04。让客户更新到此固件版本或更高版本。

受影响的产品

Mellanox Family of Adapters, VMware ESXi 7.x

产品

PowerEdge XR2, Poweredge C4140, PowerEdge C6420, PowerEdge C6520, PowerEdge C6525, PowerEdge MX5016s, PowerEdge MX740C, PowerEdge MX750c, PowerEdge MX840C, PowerEdge R350, PowerEdge R440, PowerEdge R450, PowerEdge R540, PowerEdge R550, PowerEdge R640 , PowerEdge R6415, PowerEdge R650, PowerEdge R650xs, PowerEdge R6515, PowerEdge R6525, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R7415, PowerEdge R7425, PowerEdge R750, PowerEdge R750XA, PowerEdge R750xs, PowerEdge R7515, PowerEdge R7525, PowerEdge R840, PowerEdge R940, PowerEdge R940xa, PowerEdge T350, PowerEdge T440, PowerEdge T550, PowerEdge T640, PowerEdge XE2420, PowerEdge XE7420, PowerEdge XE7440, PowerEdge XE8545, PowerEdge XR11, PowerEdge XR12, PowerFlex appliance R650, PowerFlex appliance R6525, Powerflex appliance R750, VMware ESXi 8.x, PowerFlex appliance R640, PowerFlex appliance R740XD, PowerFlex appliance R7525, PowerFlex appliance R840 ...
文章属性
文章编号: 000195461
文章类型: Solution
上次修改时间: 21 4月 2025
版本:  7
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。