Dell Unity: Both SPA and SPB panic at same time due to write request pool size
摘要: Both SPA and SPB panic at same time due to write request pool size
症狀
SPA Panicked two times:
Fri Mar 19 06:38:35 UTC 2021 system-state: set sp-critical-error
SPB Panicked three times:
Fri Mar 19 06:39:35 UTC 2021 system-state: set sp-critical-error
Fri Mar 19 06:51:05 UTC 2021 system-state: set sp-critical-error
/spa/EMC/C4Core/log> zgrep -A1 0x81254002 c4_safe_ktrace*
c4_safe_ktrace.log.10.gz:2021/03/19-06:50:11.935709 11 7FA30677870B std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.10.gz:2021/03/19-06:50:11.935710 ~~~~ 7FA30677870B std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
c4_safe_ktrace.log.15.gz:2021/03/19-06:38:29.041315 ~~~~ 7FA9A02CF709 std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.15.gz:2021/03/19-06:38:29.041316 ~~~~ 7FA9A0FF4701 std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
c4_safe_ktrace.log.23.gz:2021/03/19-05:57:33.014928 101K 7F4A6A827706 std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.23.gz:2021/03/19-05:57:33.014932 0 7F4A6A827706 std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
c4_safe_ktrace.log.7.gz:2021/03/19-06:59:52.097333 272 7FC21798570C std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.7.gz:2021/03/19-06:59:52.097334 ~~~~ 7FC21798570C std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
/spb/EMC/C4Core/log> zgrep -A1 0x81254002 c4_safe_ktrace*
c4_safe_ktrace.log.11.gz:2021/03/19-06:50:59.806541 3880 7FBDE30B570E std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.11.gz:2021/03/19-06:50:59.806545 ~~~~ 7FBDE30B570E std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
c4_safe_ktrace.log.21.gz:2021/03/19-06:39:29.781207 13 7F1582EEF70D std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.21.gz:2021/03/19-06:39:29.781208 ~~~~ 7F1582EEF70D std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
c4_safe_ktrace.log.24.gz:2021/03/19-06:10:44.513508 1105 7FDFD279970C std:RMD: Allocating write rqst failed! Status is 0x81254002
c4_safe_ktrace.log.24.gz:2021/03/19-06:10:44.513511 ~~~~ 7FDFD279970C std:Cmipci1: Notify peer that we are going down NOW. We can't wait!! (0x1)
SPA:
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 139957593749248 aka 139957593749248) [PID:30127 TID:24677 CORE:11 [csx_ic_std.x] [asyncFlush185] [03/19/2021 05:57:30 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 140366523385600 aka 140366523385600) [PID:29836 TID:31282 CORE:7 [csx_ic_std.x] [asyncFlush116] [03/19/2021 06:38:30 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 140338173945600 aka 140338173945600) [PID:29848 TID:7080 CORE:3 [csx_ic_std.x] [asyncFlush46] [03/19/2021 06:50:11 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 140471598331648 aka 140471598331648) [PID:29945 TID:25809 CORE:7 [csx_ic_std.x] [asyncFlush100] [03/19/2021 06:59:52 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
SPB:
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 140599284365056 aka 140599284365056) [PID:29864 TID:25385 CORE:13 [csx_ic_std.x] [asyncFlush214] [03/19/2021 06:09:18 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]/WriteRequestPoolSize
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 139730383755008 aka 139730383755008) [PID:30143 TID:25017 CORE:8 [csx_ic_std.x] [asyncFlush65] [03/19/2021 06:39:31 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
c4_safe_native.log:CSX RT: panic requested at: KLogBugCheck.c:57 (thread: 140453508536064 aka 140453508536064) [PID:29844 TID:27770 CORE:9 [csx_ic_std.x] [asyncFlush202] [03/19/2021 06:51:01 UTC]] (panic action:DEFAULT expr:<no-expr> flags:-) [info:0]
原因
解析度
Workaround:
Follow the below plan to increase the write request global pool size to avoid the panic. This occupies additional memory for each SP.
To increase the write request global pool size to 32768 (as an example, or any other value), follow below steps on both SPs:
- Check current value:
reg_tool get /SYSTEM/CurrentControlSet/Services/RemoteMirroring/Parameters/WriteRequestPoolSize
- Set value to 32768 (as an example)
reg_tool set /SYSTEM/CurrentControlSet/Services/RemoteMirroring/Parameters/WriteRequestPoolSize=REG_DWORD@0x00008000
- Reboot both the SPs one by one.
(Pay attention that the parameter is set to default after NDU. User needs to reset the parameter after NDU until the OE can be upgraded.)
Fix will be available in next major Unity OE code release. EE plans on increasing RMD write request pool size in the code to avoid the panic.