Highlighted
JGWisniewski
1 Copper

Avamar node crash

Yesterday our single Avamar node crashed with the following error: “HARDWARE: Aug 18 05:41:20 CPSAV001 hwfaultd: Expander Unrecoverable Error" on our Gen 4 appliance. The system did a dial home via ESRS however when Support tried to connect it was not accessible. They suggested we dispatch a CE to power drain the system. This morning while I was checking the system before the CE arrived, I popped a console cable into it and noticed that the system was trying to do a memory dump. As soon as I plugged the console cable in it started writing. I started logging Putty so that I had a record of what was happening. As soon as it was done it rebooted and came up however there was corruption. I reached back out to Support and shortly afterwards the system was back up and running after recovering the last validated checkpoint.

 

Three hours later, I log into the system to look for a deleted file and I noticed that MCS wouldn't connect. It just spun. I tried SSH next, it connects but nothing happened. I plugged a console cable back in and saw the following:

[21007.642818] Call Trace:

[21007.643337] Code: c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 <0f> b7 17 eb f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c6 <[21007.666764] Stack:

[21007.666775] Call Trace:

[21007.666811] Code: c0 00 00 00 45 31 f6 e8 d4 2a 2e 00 48 8b 83 c0 00 00 00 49 39 c5 48 8d a8 70 ff ff ff 75 1f eb 7f 0f 1f 80 00 00 00 00 66 ff 03

[21007.666826] 8b 85 90 00 00 00 49 39 c5 48 8d a8 70 ff ff ff 74 62 48 8d [21007.690720] Stack: [21007.690732] Call Trace:

[21007.690810] Code: ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c6 fa 66 0f

[21007.710687] Stack:

[21007.710699] Call Trace: [21007.710760] Code: ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c6 fa 66 0f

[21007.734651] Stack:

[21007.734663] Call Trace:

[21007.735181] Code: ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c6 fa 66 0f

[21007.918147] Stack:

[21007.920726] Call Trace:

[21007.924062] Code: 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 90 0f b7 17 eb f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48

 

That was 90 minutes ago and its still going and I still cannot get in. I am waiting for the Engineer who assisted earlier to become available for a Webex, but in the mean time, anyone have any ideas as to what is happening?

0 Kudos