Unsolved
This post is more than 5 years old
5 Posts
0
160557
Dell PowerEdge 2900 - PERC 6/i (LSI Logic SAS1078)RAID Help
I apologize in advance for a long post, but i wanted to make sure i gave as much information as possible. We recently had a server self destruct (required a hard reboot) due to a problem with what appears to be either the back plane itself or one of the physical drives? We cannot reliably determine what was the cause of failure. After gathering all information we could, searching around i could not find any information around the net nor in these forums. so, i'm posting this here in hopes as to gather some insight as to what happened, and what needs to be done to resolve the problem. My question(s) are, Is this just a simple single drive failure on slot 0 ? if so, why would it kill the entire server ? What can be done to prevent a single drive failure taking down the entire box ? Is there something more going on such as the SAS backplane needs replacement ? Thanks for any help/suggestions.
After digging through the User Manual for "uncorrectable medium error" shows that the message is considered "Fatal" (A component has failed and data loss has occurred or will occur) Further decoding of the "Unexpected sense" messages using the following http://www.t10.org/lists/asc-num.txt Intel_raid_decode_events.pdf 11h/00h = "UNRECOVERED READ ERROR" which (if im guessing right) suggests that the drive in slot 0 needs to be replaced, however Looking at the drive using MegaCli, it is not showing any error counts
[....] seqNum: 0x000038cf Time: Sat May 22 17:03:37 2010 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 00(e0x20/s0), CDB: 28 00 02 a1 99 c0 00 00 20 00, Sense: f0 00 03 02 a1 99 d4 0a 00 00 00 00 11 00 81 80 0 Event Data: =========== Device ID: 0 Enclosure Index: 32 Slot Number: 0 CDB Length: 10 CDB Data: 0028 0000 0002 00a1 0099 00c0 0000 0000 0020 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 00f0 0000 0003 0002 00a1 0099 00d4 000a 0000 0000 0000 0000 0011 0000 0081 0080 0000 0097 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 seqNum: 0x000038d0 Time: Sat May 22 18:05:32 2010 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 00(e0x20/s0), CDB: 28 00 02 a1 9d a0 00 00 20 00, Sense: f0 00 03 02 a1 9d bc 0a 00 00 00 00 11 00 81 80 0 Event Data: =========== Device ID: 0 Enclosure Index: 32 Slot Number: 0 CDB Length: 10 CDB Data: 0028 0000 0002 00a1 009d 00a0 0000 0000 0020 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 00f0 0000 0003 0002 00a1 009d 00bc 000a 0000 0000 0000 0000 0011 0000 0081 0080 0000 0097 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [.....]
OS: FreeBSD 7.0-RELEASE 0:0 RAID-1(2 drives) 136GB 0:1 RAID-10(8 drives) 544GB Total Drives: 10 pciconf Output: mfi0@pci0:1:0:0: class=0x010400 card=0x1f0c1028 chip=0x00601000 rev=0x04 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' device = 'SAS1078 PCI-X Fusion-MPT SAS' class = mass storage subclass = RAID cap 10[b0] = PCI-Express 1 endpoint cap 05[c4] = MSI supports 4 messages, 64 bit cap 11[d4] = MSI-X supports 4 messages in map 0x10 cap 01[e0] = powerspec 2 supports D0 D1 D3 current D0 cap 03[ec] = VPD MegaCli -AdpAllInfo -aall Adapter #0 ============================================================================== Versions ================ Product Name : PERC 6/i Integrated Serial No : 1122334455667788 FW Package Build: 6.0.2-0002 Mfg. Data ================ Mfg. Date : 06/08/07 Rework Date : 06/08/07 Revision No : Battery FRU : N/A Image Versions In Flash: ================ FW Version : 1.11.52-0396 BIOS Version : NT13-2 WebBIOS Version : 1.1-32-e_11-Rel Ctrl-R Version : 1.01-010B Boot Block Version : 1.00.00.01-0008 Pending Images In Flash ================ None PCI Info ================ Vendor Id : 1000 Device Id : 0060 SubVendorId : 1028 SubDeviceId : 1f0c Host Interface : PCIE Number of Frontend Port: 0 Device Interface : PCIE Number of Backend Port: 8 Port : Address 0 5000c50008e944d5 1 5000c50008e92f71 2 5000c50008e937b9 3 5000c50008e93e89 4 5001e4f33fb4ca0f 5 0000000000000000 6 0000000000000000 7 0000000000000000 HW Configuration ================ SAS Address : 5001e4f0365a0d00 BBU : Present Alarm : Absent NVRAM : Present Serial Debugger : Present Memory : Present Flash : Present Memory Size : 256MB Settings ================ Current Time : 15:7:12 5/25, 2010 Predictive Fail Poll Interval : 300sec Interrupt Throttle Active Count : 16 Interrupt Throttle Completion : 50us Rebuild Rate : 30% PR Rate : 30% Resynch Rate : 30% Check Consistency Rate : 30% Reconstruction Rate : 30% Cache Flush Interval : 4s Max Drives to Spinup at One Time : 2 Delay Among Spinup Groups : 12s Physical Drive Coercion Mode : 128MB Cluster Mode : Disabled Alarm : Disabled Auto Rebuild : Enabled Battery Warning : Enabled Ecc Bucket Size : 15 Ecc Bucket Leak Rate : 1440 Minutes Restore HotSpare on Insertion : Disabled Expose Enclosure Devices : Disabled Maintain PD Fail History : Disabled Host Request Reordering : Enabled Auto Detect BackPlane Enabled : SGPIO/i2c SEP Load Balance Mode : Auto Any Offline VD Cache Preserved : No Capabilities ================ RAID Level Supported : RAID0, RAID1, RAID5, RAID6, RAID10, RAID50, RAID60 Supported Drives : SAS, SATA Allowed Mixing: Mix In Enclosure Allowed Status ================ ECC Bucket Count : 0 Limitations ================ Max Arms Per VD : 32 Max Spans Per VD : 8 Max Arrays : 128 Max Number of VDs : 64 Max Parallel Commands : 1008 Max SGE Count : 80 Max Data Transfer Size : 8192 sectors Max Strips PerIO : 42 Min Stripe Size : 8kB Max Stripe Size : 1024kB Device Present ================ Virtual Drives : 2 Degraded : 0 Offline : 0 Physical Devices : 11 Disks : 10 Critical Disks : 0 Failed Disks : 0 Supported Adapter Operations ================ Rebuild Rate : Yes CC Rate : Yes BGI Rate : Yes Reconstruct Rate : Yes Patrol Read Rate : Yes Alarm Control : Yes Cluster Support : No BBU : Yes Spanning : Yes Dedicated Hot Spare : Yes Revertible Hot Spares : No Foreign Config Import : Yes Self Diagnostic : Yes Allow Mixed Redundancy on Array : No Global Hot Spares : Yes Deny SCSI Passthrough : No Deny SMP Passthrough : No Deny STP Passthrough : No Supported VD Operations ================ Read Policy : Yes Write Policy : Yes IO Policy : Yes Access Policy : Yes Disk Cache Policy : Yes Reconstruction : Yes Deny Locate : No Deny CC : No Supported PD Operations ================ Force Online : Yes Force Offline : Yes Force Rebuild : Yes Deny Force Failed : No Deny Force Good/Bad : No Deny Missing Replace : No Deny Clear : No Deny Locate : No Disable Copyback : No Enable Copyback on SMART : No Enable Copyback to SSD on SMART error : No Error Counters ================ Memory Correctable Errors : 0 Memory Uncorrectable Errors : 0 Cluster Information ================ Cluster Permitted : No Cluster Active : No Default Settings ================ Phy Polarity : 0 Phy PolaritySplit : 0 Background Rate : 30 Stripe Size : 64kB Flush Time : 4 seconds Write Policy : WB Read Policy : None Cache When BBU Bad : Disabled Cached IO : No SMART Mode : Mode 6 Alarm Disable : No Coercion Mode : 128MB ZCR Config : Unknown Dirty LED Shows Drive Activity : No BIOS Continue on Error : No Spin Down Mode : None Allowed Device Type : SAS/SATA Mix Allow Mix In Enclosure : Yes Allow HDD SAS/SATA Mix In VD : No Allow SSD SAS/SATA Mix In VD : No Allow HDD/SAS Mix In VD : No Allow SATA In Cluster : No Max Chained Enclosures : 1 Disable Ctrl-R : No Enable Web BIOS : No Direct PD Mapping : Yes BIOS Enumerate VDs : Yes Restore Hot Spare on Insertion : No Expose Enclosure Devices : No Maintain PD Fail History : No Disable Puncturing : No Zero Based Enclosure Enumeration : Yes PreBoot CLI Enabled : No LED Show Drive Activity : No Cluster Disable : Yes SAS Disable : No Auto Detect BackPlane Enable : SGPIO/i2c SEP Delay during POST : 0 [....] On the console, there were what seemed hundreds of command timeout messages "mfi0: COMMAND 0xffffff(xxxxxxx) TIMEOUT AFTER XXX SECONDS" (sorry, i don't have the exact command and num of seconds) After we rebooted the server, looking at the logs, we saw a lot of these messages [...]May 25 13:20:42 server kernel: mfi0: 11410 (312692400s/0x0020/0) - Patrol Read started May 25 13:20:42 server kernel: mfi0: 11445 (312693679s/0x0002/0) - PD 00(e32/s0) CDB 2f:00:02:a1:90:00:00:10:00:00Sense f0:00:03:02:a1:99:d4:0a:00:00:00:00:11:00:81:80:00:97 May 25 13:20:42 server kernel: : Unexpected sense: PD 00(e0x20/s0), CDB: 2f 00 02 a1 90 00 00 10 00 00, Sense: f0 00 03 02 a1 99 d4 0a 00 00 00 00 11 00 81 80 0 May 25 13:20:42 server kernel: mfi0: 11446 (312693681s/0x0002/0) - PD 00(e32/s0) CDB 2e:00:02:a1:99:d4:00:00:01:00Sense f0:00:03:02:a1:99:d4:0a:00:00:00:00:11:00:81:80:00:97 May 25 13:20:42 server kernel: : Unexpected sense: PD 00(e0x20/s0), CDB: 2e 00 02 a1 99 d4 00 00 01 00, Sense: f0 00 03 02 a1 99 d4 0a 00 00 00 00 11 00 81 80 0 May 25 13:20:42 server kernel: mfi0: 11447 (312693681s/0x0002/3) - PD 00(e32/s0) lba 44145108: Patrol Read found an uncorrectable medium error on PD 00(e0x20/s0) at 2a199d4 May 25 13:20:42 server kernel: mfi0: 11448 (312693682s/0x0002/0) - PD 00(e32/s0) CDB 2f:00:02:a1:99:d5:00:10:00:00Sense f0:00:03:02:a1:9d:bc:0a:00:00:00:00:11:00:81:80:00:97 May 25 13:20:42 server kernel: : Unexpected sense: PD 00(e0x20/s0), CDB: 2f 00 02 a1 99 d5 00 10 00 00, Sense: f0 00 03 02 a1 9d bc 0a 00 00 00 00 11 00 81 80 0 May 25 13:20:42 server kernel: mfi0: 11449 (312693686s/0x0002/0) - PD 00(e32/s0) CDB 2e:00:02:a1:9d:bc:00:00:01:00Sense f0:00:03:02:a1:9d:bc:0a:00:00:00:00:11:00:81:80:00:97 May 25 13:20:42 server kernel: : Unexpected sense: PD 00(e0x20/s0), CDB: 2e 00 02 a1 9d bc 00 00 01 00, Sense: f0 00 03 02 a1 9d bc 0a 00 00 00 00 11 00 81 80 0 May 25 13:20:42 server kernel: mfi0: 11450 (312693686s/0x0002/3) - PD 00(e32/s0) lba 44146108: Patrol Read found an uncorrectable medium error on PD 00(e0x20/s0) at 2a19dbc May 25 13:20:42 server kernel: mfi0: 11527 (312705028s/0x0020/0) - Patrol Read complete [...]
After digging through the User Manual for "uncorrectable medium error" shows that the message is considered "Fatal" (A component has failed and data loss has occurred or will occur) Further decoding of the "Unexpected sense" messages using the following http://www.t10.org/lists/asc-num.txt Intel_raid_decode_events.pdf 11h/00h = "UNRECOVERED READ ERROR" which (if im guessing right) suggests that the drive in slot 0 needs to be replaced, however Looking at the drive using MegaCli, it is not showing any error counts
MegaCli -PDInfo -PhysDrv \[32:0\] -aALL Enclosure Device ID: 32 Slot Number: 0 Device Id: 0 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 140014MB [0x11177328 Sectors] Non Coerced Size: 139502MB [0x11077328 Sectors] Coerced Size: 139392MB [0x11040000 Sectors] Firmware state: Online SAS Address(0): 0x5000c50008e944d5 SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST3146855SS S5273LN4XNPX Foreign State: None Media Type: Hard Disk Device Device Speed: Unknown Link Speed: Unknown Checking AdpEventLog does however show the same errors as in the syslogs MegaCli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log
[....] seqNum: 0x000038cf Time: Sat May 22 17:03:37 2010 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 00(e0x20/s0), CDB: 28 00 02 a1 99 c0 00 00 20 00, Sense: f0 00 03 02 a1 99 d4 0a 00 00 00 00 11 00 81 80 0 Event Data: =========== Device ID: 0 Enclosure Index: 32 Slot Number: 0 CDB Length: 10 CDB Data: 0028 0000 0002 00a1 0099 00c0 0000 0000 0020 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 00f0 0000 0003 0002 00a1 0099 00d4 000a 0000 0000 0000 0000 0011 0000 0081 0080 0000 0097 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 seqNum: 0x000038d0 Time: Sat May 22 18:05:32 2010 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 00(e0x20/s0), CDB: 28 00 02 a1 9d a0 00 00 20 00, Sense: f0 00 03 02 a1 9d bc 0a 00 00 00 00 11 00 81 80 0 Event Data: =========== Device ID: 0 Enclosure Index: 32 Slot Number: 0 CDB Length: 10 CDB Data: 0028 0000 0002 00a1 009d 00a0 0000 0000 0020 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 00f0 0000 0003 0002 00a1 009d 00bc 000a 0000 0000 0000 0000 0011 0000 0081 0080 0000 0097 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [.....]
theflash1932
7 Technologist
7 Technologist
•
16.3K Posts
0
May 25th, 2010 15:00
It would appear that it is just the drive. A single drive failure does not normally cause the system to crash, but based on the nature of the drive failure, it can cause the system to crash, especially if it is a power issue or it is sending unrecognizable data to the controller. I would not risk keeping that drive in the system for troubleshooting ... I would simply replace the drive. After replacing the drive, I would recommend updating the system firmware (BIOS, ESM, RAID controller). The RAID firmware can contain code to help the controller to better deal with errors - possibly even of this type.
bnyec
5 Posts
0
May 25th, 2010 16:00
theflash1932
7 Technologist
7 Technologist
•
16.3K Posts
0
May 25th, 2010 20:00
Always use drivers from Dell.com. If they aren't available on Dell.com (say you are installing an unsupported OS), then go to the device manufacturer. Even though Dell-branded hardware usually has a brand name equivalent, there are often small tweaks or differences to allow it guaranteed performance inside of the Dell machines. You can get all the drivers at the link below.
As you are not using a supported Linux distro, I don't know if the Linux packages will run to update from the OS. You can update outside of the OS by downloading SBUU (System Build and Update Utility) and the (SUU) Server Update Utility (all three images and merge using commands in the instructions).
Drivers and Downloads Page:
http://support.dell.com/support/downloads/driverslist.aspx?os=WNET&catid=-1&dateid=-1&impid=-1&osl=EN&typeid=-1&formatid=-1&servicetag=&SystemID=PWE_2900&hidos=WS8R2&hidlang=en&TabIndex=&scanSupported=True&scanConsent=False
SBUU:
SUU:
<ADMIN NOTE: Broken link has been removed / replaced from this post by Dell>
bnyec
5 Posts
0
May 26th, 2010 11:00
bnyec
5 Posts
0
June 2nd, 2010 13:00