Dell VNX1 系列:SP 因錯誤檢查代碼而重新開機:00000041 HEMI_CPU1_WATCHDOG

摘要: VNX1 儲存處理器可能會重新開機,因為管理連接埠廣播風暴導致 Flare CPU 排程不足0x41 HEMI_CPU1_WATCHDOG觸發錯誤檢查。 升級到 Flare OE 05.32.000.5.225+ 並糾正網路錯誤配置(例如防止與 STP 迴圈)可以解決此問題。(Dell 可修正)

本文章適用於 本文章不適用於 本文無關於任何特定產品。 本文未識別所有產品版本。

症狀

SP 已重新開機,原因如下: Bug check code: 00000041 HEMI_CPU1_WATCHDOG

DGSSP 76008106 The Storage Processor rebooted unexpectedly @ 22:55:01 on 08/05/2018: BugCheck 0, {0000000000000000, 0000000000000041, 0000000000000a20, fffff880038bbfc0}, Failing Instruction: 0xfffff8803d89dfa7 in flaredrv.sys loaded @ 0xfffff8803d627000 76008106 [HEMI_CPU1_WATCHDOG <Flare>]
[ BugcheckCode: 41 Definition: HEMI_CPU1_WATCHDOG ]

 

原因

從記錄中發現 SP 因以下原因重新開機,從記錄中發現 SP 無法與同儕 SP 通訊。

Bugcheck: 00000041 HEMI_CPU1_WATCHDOG

KTlogs 中顯示 CMI 通訊問題。

廣播流量被發送到管理埠的風暴。這會導致乙太網驅動程式的線程以高速率運行,並使其他線程耗盡。這最終導致了恐慌。

 

解析度

在 Flare OE MR1 SP6 [05.32.000.5.225] 中修復了此問題。

將陣列升級到 Flare OE 05.32.000.5.225 或更新版本

使用者也應調查網路中是否有可能導致網路風暴的錯誤組態。

 

其他資訊

從 (SP 收集) TRiiAGE_Analysis檔案:

SPA------------ SPB------------
Array Software Revision: 05.32.000.5.221 05.32.000.5.221
SP Time: 08/06/2018 02:49:58 08/06/2018 02:49:54
SP Uptime: 260 days 01:00:20 03:51:37
Read Cache State: ENABLED ENABLED
Write Cache State: ENABLED ENABLED
Greater WC Availability: ENABLED ENABLED
System Fault LED: OFF OFF

SPB has encountered Bugcheck code 00000041 on 08/05/2018 22:58:55.
Bugcheck Name and Definition
*****************************
HEMI_CPU1_WATCHDOG - FLARE could not reschedule its thread as the all CPU time is used up by a particular process not allowing FLARE to reschedule its process.

Recommendation
**************
RCA: The panic might happen on a visibly idle system if it experiences network packet storm on Management port - see AR 648546. This happens if there is a misbehaving host flooding network with its packets. Typically this is caused by having an ethernet loop on the network, such that any broadcast packets on the network end up circulating without end. Dual panic is likely. Partial fix, which ensures recommended registry setting, is in R32.215 tracked by 614657. Dup to AR 667187, which is tracking the fix in R32 MR1 SP6. Workaround: enable Spanning Tree Protocol for preventing this condition. The panic can also happen due to a bug in monitoring software, which leaks handles, exposed when there are drive faults. Fixed in R32.221 tracked by AR 734114 and in R33.96 tracked by AR 681805.


B 08/05/18 23:09:20 DGSSP 76008106 The Storage Processor rebooted unexpectedly @ 22:55:01 on 08/05/2018: BugCheck 0, {0000000000000000, 0000000000000041, 0000000000000a20, fffff880038bbfc0}, Failing Instruction: 0xfffff8803d89dfa7 in flaredrv.sys loaded @ 0xfffff8803d627000 76008106 [HEMI_CPU1_WATCHDOG <Flare>]
[ BugcheckCode: 41 Definition: HEMI_CPU1_WATCHDOG ]

傾印記錄中的例項:

FAILURE_BUCKET_ID: X64_0x0_flaredrv!hemi_panic+bb

BUCKET_ID: X64_0x0_flaredrv!hemi_panic+bb

22:29:01.616 0 FFFFF880009D5FC0 TCD1: Initiator 5000144290571910 LUN 29 Tag 0339 aborted.
22:29:01.617 165 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0619
22:29:01.617 13 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 24 Tag 0619 aborted.
22:29:01.617 12 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x012a
22:29:01.617 7 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 24 Tag 012A aborted.
22:29:01.617 7 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0107
22:29:01.617 9 FFFFF8800384AFC0 TCD2: Initiator 5000144290571911 LUN 15 Tag 0107 aborted.
22:29:02.252 635467 FFFFFA80074BE7A0 MIR: MirrorReadComplete() (IRP=FFFFFA80501DF010) Retry subrequest 0xFFFFFA804FFC6990
22:29:02.252 23 FFFFFA80074BE7A0 INFO LIB FAPI 100C0 : fbe_api_common_send_io_packet:Object went away while IO was on it's way
22:29:02.252 3 FFFFFA80074BE7A0 PPFD: ppfdPhysicalPackageSendRead fbe_api_common_send_io_packet failed fbeStatus= FBE_STATUS_EDGE_NOT_ENABLED
22:29:02.356 103752 FFFFF880009D5FC0 FCDMTL 0 (FE0/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x01a6
22:29:02.356 25 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x00b3
22:29:02.356 7 FFFFF880009D5FC0 TCD1: Initiator 5000144280571910 LUN 1C Tag 01A6 aborted.
22:29:02.356 16 FFFFF880009D5FC0 FCDMTL 0 (FE0/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x809e
22:29:02.356 9 FFFFF880009D5FC0 TCD1: Initiator 5000144280571910 LUN 1A Tag 809E aborted.
22:29:02.356 1 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 21 Tag 00B3 aborted.
22:29:02.356 30 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x0446
22:29:02.356 12 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 11 Tag 0446 aborted.
22:29:02.356 5 FFFFF8800384AFC0 FCDMTL 1 (FE1/SC) Abort received of type ABORT_TASK / ABTS for reason ABTS_RECD of exchange 0x03d3
22:29:02.356 7 FFFFF8800384AFC0 TCD2: Initiator 5000144280571911 LUN 10 Tag 03D3 aborted.
22:29:02.518 161696 FFFFF880038BBFC0 FLARE_CPU_WATCHDOG: Flare not rescheduled (3135202894) in 256 ticks, hproc 0xFFFFF8803DCCA2A0
22:29:02.518 12 FFFFF880038BBFC0 *** PANIC HEMI_CPU1_WATCHDOG: 0x00000041, __LINE__: 0x0000000000000A20
22:29:02.518 21 FFFFF880038BBFC0 catmerge/disk/flare/hemi/hemi_process.c:2592
22:29:02.518 8 FFFFF880038BBFC0 Cmipci0: Notify peer that we are going down NOW. We can't wait!! (0x1)
22:29:02.518 4 FFFFF880038BBFC0 *** PANIC 0x00000041, 0x0000000000000A20 (2592)

KTlogs 中的例項:

22:30:07.311 14843432 FFFFF88003ABBFC0 3 peer:*** PANIC HEMI_CPU1_WATCHDOG: 0x00000041, __LINE__: 0x0000000000000A20
22:30:08.074 762941 FFFFFA8008427730 3 std:PciHal: Peer CMI ping INACTIVE 44564 44564 1
22:30:08.184 109366 FFFFF80001A55CC0 0 std:Hbd: Current PeerState: 1 NewState: 0
22:30:08.184 4 FFFFF80001A55CC0 0 std:Hbd: new peer state Hbd_State_Not_Running.
22:30:08.184 1 FFFFF80001A55CC0 0 std:CMID received Hbd_Event_Peer_Not_Running
22:30:08.184 9 FFFFF80001A55CC0 0 std:CMID HBD found local peer link 0... sending ping
22:30:08.184 5 FFFFF80001A55CC0 0 std:Cmipci0: Peer paniced, assume ping failure.
22:30:08.184 5 FFFFF80001A55CC0 0 std:PciHal: Client Ping Request 2

22:30:08.184 5 FFFFFA8008D3E040 0 std:CMID Bundle Failed 9c6be0be60010650:1 Quiescing => Quiescing
22:30:08.184 5 FFFFFA8008D3E040 0 std:CMID CmiPeerChannel::MaybeTakeQuiecingSpOfflineReleaseLock,
22:30:08.184 4 FFFFFA8008D3E040 0 std:Hbd: assert Hbd_State_Running_No_Cmi.
22:30:08.184 0 FFFFFA8008D3E040 0 std:CMID CmiPeerChannel::MaybeTakeQuiecingSpOfflineReleaseLock Quiescing
22:30:08.184 7 FFFFFA8008D3E040 0 std:MPS:Lost contact on conduit 0x3 with 9c6be0be60010650:1
22:30:08.184 10 FFFFFA8008D3E040 0 std:CPMCTRL:Callback(): Event Sp Contact Lost!
22:30:08.184 4 FFFFFA8008D3E040 0 std:TMOS_Supervisor::MpsCallback() Event: CONTACT_LOST

 

受影響的產品

VNX1 Series

產品

VNX1 Series, VNX5100, VNX5150, VNX5300, VNX5500, VNX5700, VNX7500
文章屬性
文章編號: 000054971
文章類型: Solution
上次修改時間: 30 4月 2026
版本:  4
向其他 Dell 使用者尋求您問題的答案
支援服務
檢查您的裝置是否在支援服務的涵蓋範圍內。