Unsolved
This post is more than 5 years old
12 Posts
0
7140
September 10th, 2012 15:00
Unexpected storage processor reboot
My Dell branded CX4-120 storage processor B unexpectedly rebooted over the weekend. Everything recovered and the array has no outstanding alerts as of now but Dell wants to replace storage processor B because of Event Code:0x7127ca2a.
There is the event code from SPA that recommends replacing SPB, but there are others indicating SPB crashed with a bug check. Isn't this a fancy name for Windows Blue Screen of Death. I'm wondering if the event that says SPB is faulted and needs to be replaced is just a false alert because SPB froze at the point of the bug check and was completely non-responsive.
Any opinions on this?
From SPA
==============
Date:2012-09-08
Time:15:52:42
Event Code:0x71100001
Description:Lost contact with fc1e60c460010650:1 on conduit 3. 00 00 04 00 04 00 4c 00 d3 04 00 00 01 00 10 61 01 00 10 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 71 10 00 01
Subsystem:APM00100102072
Device:N/A
SP:N/A
Host:bznsnarp02_SPA
Source:mps
Category:NT System Log
Log:NT System Log
Sense Key:N/A
Ext Code1:N/A
Ext Code2:N/A
Type:Information
--- many --- intervening --- events --- in --- between
Date:2012-09-08
Time:15:55:38
Event Code:0x7127ca2a
Description:SPB is faulted. Fault Code: 0. FRU: CPU Module - Part Number: 303-093-001B should be replaced. Please call your service provider. 00000400 05005600 d3040000 2aca27e1 2aca27e1 00000000 00000000 00000000 00000000 00000000 7127ca2a
Subsystem:APM00100102072
Device:N/A
SP:N/A
Host:bznsnarp02_SPA
Source:Flaredrv
Category:NT System Log
Log:NT System Log
Sense Key:N/A
Ext Code1:N/A
Ext Code2:N/A
Type:Critical Error
From SPB
==============
Date:2012-09-08
Time:16:07:15
Event Code:0x2183
Description:The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009c (0x0000000000000001, 0xfffffadfbee91280, 0x00000000b2000000, 0x0000000000000175). A dump was saved in: C:\dumps\crash.dmp.
Subsystem:APM00100102072
Device:N/A
SP:N/A
Host:bznsnarp02_SPB
Source:Save Dump
Category:NT System Log
Log:NT System Log
Sense Key:N/A
Ext Code1:N/A
Ext Code2:N/A
Type:Error
Date:2012-09-08
Time:16:14:45
Event Code:0x76008106
Description:The Storage Processor rebooted unexpectedly @ 21:52:40 on 09/08/2012: BugCheck 9C, {0000000000000001, fffffadfbee91280, 00000000b2000000, 0000000000000175}, Failing Instruction: 0xfffff8000080ef6c in hal.dll loaded @ 0xfffff80000800000 76 00 81 06
Subsystem:APM00100102072
Device:N/A
SP:N/A
Host:bznsnarp02_SPB
Source:K10_DGSSP
Category:NT Application Log
Log:NT Application Log
Sense Key:N/A
Ext Code1:N/A
Ext Code2:N/A
Type:Error
0 events found


Anirudh_Banerje
59 Posts
1
September 11th, 2012 03:00
Hi
Yes, you are correct, in this case, the bugcheck occured is nothing but a fancy name of Windows BSD.
But, the alert due to which SPB rebooted is not a false alert. any bugcheck or Windows BSD may occur due to hardware interrupt caused or any driver/application related issues..in this case it is the SPB CPU module which may have caused the hardware interrupt which resulted in the bugcheck error and rebooted the SPB..this clearly states in the event message. It is recommended for SPB replacement.
If you need more investigation, get Dell involved to analyse the dump files generated due to bugcheck. This can provide you exact root cause.
Thanks
Anirudh
DELL-Sheron G
Moderator
•
228 Posts
1
September 11th, 2012 07:00
Greetings,
Emc161432 talks about the 0x7127ca2a alert.
That's an old primus however I reckon that we have some issue with the SP's CPU Module.
You mighrt need to replace the part.
Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit (CPU) detects a hardware problem. The error usually occurs due to failure or overstressing of hardware components where the error cannot be more specifically identified with a different error message.
I would recomend you to get an EMC Ticket raised, you would be asked to provide SP collects and DUMP file located at C:\dumps\crash.dmp from the rebooted SP.
Bugchecks sometimes could be hazardous causing memory leaks.
Regards,
Sheron Godfred
AnkitMehta
4 Apprentice
•
1.4K Posts
0
September 11th, 2012 08:00
Hi null Poore,
I believe we are not running this CX4-120 on R30.524 and most probably it is on R28-29!
Before, you go for any hardware replacements, I would request you to check the below points:
- Get the Health Check Done by Dell Support. (Agreed, there was an SP Reboot due to BugCheck)
- Check if the fault LED on the storage system is Amber.(If you have physical access to the Storage)
- Verify from the Navisphere/Unisphere Fault Status Report if the CPU Module is faulted or if anything is faulted.
- Remember, If you have got Error: 7127ca2a due to BugCheck there is a high chance that CPU Module could go back online. (Do NOT replace it until and unless fault is verified with SP Collects Analysis)
- Lastly, upgrade the FLARE to R30.524 once you are sure the storage system is normal.
PS: We like the SP to get itself back online, but if the hardware is marginal, the problem could return and so L2 or Engineering should review for 'proactive' replacement recommnedations and this FRU can not be done with out verifying the exact fault by analyzing the SP Collects.
VincePoore
12 Posts
0
September 11th, 2012 08:00
Thank you for the feedback. I will proceed with having SPB replaced. Is this something that can be done while the array remains up and running on SPA?
Anirudh_Banerje
59 Posts
0
September 11th, 2012 08:00
Hi
Yes, you can replace SPB without scheduling any downtime as long as SPA is alive.. Always recommended to schedule downtime and then go for replacement of a SP..this is always a best practice. Like Sheron said, plan for FLARE upgrade activity as well to latest version.
Thanks
Anirudh
DELL-Sheron G
Moderator
•
228 Posts
0
September 11th, 2012 08:00
Yes it can be done online if you have a proper failover.
And replacement may not be the actual resolution.
We also would need to upgrade the FLARE, perhaps the issue would have been created due to a bug in the FLARE.
Regards.
Sheron Godfred