Dell VNX1 系列:由于错误检查代码,SP 重新启动:00000041 HEMI_CPU1_WATCHDOG
摘要: 由于管理端口广播风暴对 Flare CPU 调度不足所触发的错误检查0x41 HEMI_CPU1_WATCHDOG,VNX1 存储处理器可能会重新启动。 升级到 Flare OE 05.32.000.5.225+ 并纠正网络配置错误(例如使用 STP 防止环路)可解决此问题。(戴尔可纠正)
本文适用于
本文不适用于
本文并非针对某种特定的产品。
本文并非包含所有产品版本。
症状
SP 由于以下原因而重新启动: Bug check code: 00000041 HEMI_CPU1_WATCHDOG
DGSSP 76008106 The Storage Processor rebooted unexpectedly @ 22:55:01 on 08/05/2018: BugCheck 0, {0000000000000000, 0000000000000041, 0000000000000a20, fffff880038bbfc0}, Failing Instruction: 0xfffff8803d89dfa7 in flaredrv.sys loaded @ 0xfffff8803d627000 76008106 [HEMI_CPU1_WATCHDOG <Flare>]
[ BugcheckCode: 41 Definition: HEMI_CPU1_WATCHDOG ]
原因
从日志中发现 SP 由于以下原因重新启动,从日志中发现 SP 无法与对等 SP 通信。
Bugcheck: 00000041 HEMI_CPU1_WATCHDOG
KTlogs 中显示 CMI 通信问题。
有一股广播流量风暴发送到管理端口。这导致以太网驱动程序的线程以高速率运行,并耗尽其他线程。这最终导致了恐慌。
解决方案
此问题已在 Flare OE MR1 SP6 [05.32.000.5.225] 中得到修复
将阵列升级到 Flare OE 05.32.000.5.225 或更高版本。
用户还应调查网络是否存在可能导致网络风暴的错误配置。
其他信息
从(SP 收集)TRiiAGE_Analysis文件:
SPA------------ SPB------------
Array Software Revision: 05.32.000.5.221 05.32.000.5.221
SP Time: 08/06/2018 02:49:58 08/06/2018 02:49:54
SP Uptime: 260 days 01:00:20 03:51:37
Read Cache State: ENABLED ENABLED
Write Cache State: ENABLED ENABLED
Greater WC Availability: ENABLED ENABLED
System Fault LED: OFF OFF
SPB has encountered Bugcheck code 00000041 on 08/05/2018 22:58:55.
Bugcheck Name and Definition
*****************************
HEMI_CPU1_WATCHDOG - FLARE could not reschedule its thread as the all CPU time is used up by a particular process not allowing FLARE to reschedule its process.
Recommendation
**************
RCA: The panic might happen on a visibly idle system if it experiences network packet storm on Management port - see AR 648546. This happens if there is a misbehaving host flooding network with its packets. Typically this is caused by having an ethernet loop on the network, such that any broadcast packets on the network end up circulating without end. Dual panic is likely. Partial fix, which ensures recommended registry setting, is in R32.215 tracked by 614657. Dup to AR 667187, which is tracking the fix in R32 MR1 SP6. Workaround: enable Spanning Tree Protocol for preventing this condition. The panic can also happen due to a bug in monitoring software, which leaks handles, exposed when there are drive faults. Fixed in R32.221 tracked by AR 734114 and in R33.96 tracked by AR 681805.
B 08/05/18 23:09:20 DGSSP 76008106 The Storage Processor rebooted unexpectedly @ 22:55:01 on 08/05/2018: BugCheck 0, {0000000000000000, 0000000000000041, 0000000000000a20, fffff880038bbfc0}, Failing Instruction: 0xfffff8803d89dfa7 in flaredrv.sys loaded @ 0xfffff8803d627000 76008106 [HEMI_CPU1_WATCHDOG <Flare>]
[ BugcheckCode: 41 Definition: HEMI_CPU1_WATCHDOG ]
来自转储日志的实例:
FAILURE_BUCKET_ID: X64_0x0_flaredrv!hemi_panic+bb BUCKET_ID: X64_0x0_flaredrv!hemi_panic+bb 22:29:01.616 0 FFFFF880009D5FC0 TCD1: Initiator 5000144290571910 LUN 29 Tag 0339 aborted. 22:29:01.617 165 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0619 22:29:01.617 13 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 24 Tag 0619 aborted. 22:29:01.617 12 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x012a 22:29:01.617 7 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 24 Tag 012A aborted. 22:29:01.617 7 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0107 22:29:01.617 9 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 15 Tag 0107 aborted. 22:29:02.252 635467 FFFFFA80074BE7A0 MIR: MirrorReadComplete() (IRP=FFFFFA80501DF010) Retry subrequest 0xFFFFFA804FFC6990 22:29:02.252 23 FFFFFA80074BE7A0 INFO LIB FAPI 100C0 : fbe_api_common_send_io_packet:Object went away while IO was on it's way 22:29:02.252 3 FFFFFA80074BE7A0 PPFD: ppfdPhysicalPackageSendRead fbe_api_common_send_io_packet failed fbeStatus= FBE_STATUS_EDGE_NOT_ENABLED 22:29:02.356 103752 FFFFF880009D5FC0 FCDMTL 0 (FE0/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x01a6 22:29:02.356 25 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x00b3 22:29:02.356 7 FFFFF880009D5FC0 TCD1: Initiator 5000144280571910 LUN 1C Tag 01A6 aborted. 22:29:02.356 16 FFFFF880009D5FC0 FCDMTL 0 (FE0/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x809e 22:29:02.356 9 FFFFF880009D5FC0 TCD1: Initiator 5000144280571910 LUN 1A Tag 809E aborted. 22:29:02.356 1 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 21 Tag 00B3 aborted. 22:29:02.356 30 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0446 22:29:02.356 12 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 11 Tag 0446 aborted. 22:29:02.356 5 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x03d3 22:29:02.356 7 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 10 Tag 03D3 aborted. 22:29:02.518 161696 FFFFF880038BBFC0 FLARE_CPU_WATCHDOG: Flare not rescheduled (3135202894) in 256 ticks, hproc 0xFFFFF8803DCCA2A0 22:29:02.518 12 FFFFF880038BBFC0 *** PANIC HEMI_CPU1_WATCHDOG: 0x00000041, __LINE__: 0x0000000000000A20 22:29:02.518 21 FFFFF880038BBFC0 catmerge/disk/flare/hemi/hemi_process.c:2592 22:29:02.518 8 FFFFF880038BBFC0 Cmipci0: Notify peer that we are going down NOW. We can't wait!! (0x1) 22:29:02.518 4 FFFFF880038BBFC0 *** PANIC 0x00000041, 0x0000000000000A20 (2592)
来自 KTlogs 的实例:
22:30:07.311 14843432 FFFFF88003ABBFC0 3 peer:*** PANIC HEMI_CPU1_WATCHDOG: 0x00000041, __LINE__: 0x0000000000000A20 22:30:08.074 762941 FFFFFA8008427730 3 std:PciHal: Peer CMI ping INACTIVE 44564 44564 1 22:30:08.184 109366 FFFFF80001A55CC0 0 std:Hbd: Current PeerState: 1 NewState: 0 22:30:08.184 4 FFFFF80001A55CC0 0 std:Hbd: new peer state Hbd_State_Not_Running. 22:30:08.184 1 FFFFF80001A55CC0 0 std:CMID received Hbd_Event_Peer_Not_Running 22:30:08.184 9 FFFFF80001A55CC0 0 std:CMID HBD found local peer link 0... sending ping 22:30:08.184 5 FFFFF80001A55CC0 0 std:Cmipci0: Peer paniced, assume ping failure. 22:30:08.184 5 FFFFF80001A55CC0 0 std:PciHal: Client Ping Request 2 22:30:08.184 5 FFFFFA8008D3E040 0 std:CMID Bundle Failed 9c6be0be60010650:1 Quiescing => Quiescing 22:30:08.184 5 FFFFFA8008D3E040 0 std:CMID CmiPeerChannel::MaybeTakeQuiecingSpOfflineReleaseLock, 22:30:08.184 4 FFFFFA8008D3E040 0 std:Hbd: assert Hbd_State_Running_No_Cmi. 22:30:08.184 0 FFFFFA8008D3E040 0 std:CMID CmiPeerChannel::MaybeTakeQuiecingSpOfflineReleaseLock Quiescing 22:30:08.184 7 FFFFFA8008D3E040 0 std:MPS:Lost contact on conduit 0x3 with 9c6be0be60010650:1 22:30:08.184 10 FFFFFA8008D3E040 0 std:CPMCTRL:Callback(): Event Sp Contact Lost! 22:30:08.184 4 FFFFFA8008D3E040 0 std:TMOS_Supervisor::MpsCallback() Event: CONTACT_LOST
受影响的产品
VNX1 Series产品
VNX1 Series, VNX5100, VNX5150, VNX5300, VNX5500, VNX5700, VNX7500文章属性
文章编号: 000054971
文章类型: Solution
上次修改时间: 30 4月 2026
版本: 4
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。