Spontaneous remove/insert events on 2950 with perc 5/i

Lately, I see PDs going offline and back online spontaneously about 10 seconds later. At least two out of my four SAS disks in a RAID10 array go offline and either trigger a rebuild or an offline state if a second disk goes offline before the rebuild completes. If that happen, I will find the array with foreign configurations I cannot import. However, if I clear the configuration and recreate the array without initializing it, everything runs perfectly again until the next spontaneous offline event.

Wat could this be? Bad connections? Flaky disks? The raid controller? This has happened a couple of times now and it's driving me nuts.

A bit of the event log from the controller showing

seqNum: 0x00003132 Tue Apr 19 23:10:17 2011 Code: 0x00000070
Removed: PD 02(e1/s2)
seqNum: 0x00003133 Tue Apr 19 23:10:17 2011 Code: 0x000000f8
Removed: PD 02(e1/s2) Info: enclPd=08, scsiType=0, portMap=04, sasAddr=50010b9000063e0e,0000000000000000
seqNum: 0x00003134 Tue Apr 19 23:10:17 2011 Code: 0x00000051
State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2)
seqNum: 0x00003135 Tue Apr 19 23:10:17 2011 Code: 0x000000fb
VD 00/0 is now DEGRADED
seqNum: 0x00003136 Tue Apr 19 23:10:17 2011 Code: 0x00000072
State change on PD 02(e1/s2) from ONLINE(18) to FAILED(11)
seqNum: 0x00003137 Tue Apr 19 23:10:17 2011 Code: 0x00000072
State change on PD 02(e1/s2) from FAILED(11) to UNCONFIGURED_BAD(1)
seqNum: 0x00003138 Tue Apr 19 23:10:28 2011 Code: 0x0000005b
Inserted: PD 02(e1/s2)
seqNum: 0x00003139 Tue Apr 19 23:10:28 2011 Code: 0x000000f7
Inserted: PD 02(e1/s2) Info: enclPd=08, scsiType=0, portMap=04, sasAddr=50010b9000063e0e,0000000000000000
seqNum: 0x0000313a Tue Apr 19 23:10:28 2011 Code: 0x000000ec
PD 02(e1/s2) is not a certified drive
seqNum: 0x0000313b Tue Apr 19 23:10:28 2011 Code: 0x00000072
State change on PD 02(e1/s2) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0)
seqNum: 0x0000313c Tue Apr 19 23:10:31 2011 Code: 0x00000072
State change on PD 02(e1/s2) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
seqNum: 0x0000313d Tue Apr 19 23:10:31 2011 Code: 0x0000006a
Rebuild automatically started on PD 02(e1/s2)
seqNum: 0x0000313e Tue Apr 19 23:10:31 2011 Code: 0x00000072
State change on PD 02(e1/s2) from OFFLINE(10) to REBUILD(14)
seqNum: 0x000031a3 Tue Apr 19 23:52:53 2011 Code: 0x00000064
Rebuild complete on PD 02(e1/s2)
seqNum: 0x000031a4 Tue Apr 19 23:52:53 2011 Code: 0x00000051
State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)

Responses(4)

A

Anonymous

5 Practitioner

•

274.2K Posts

1

July 8th, 2011 06:00

There can be a number of issues regarding the controller log provided. The positive is the drive does rebuild back to an optimal state which rules out a host of other scenarios.

With this, I would recommend starting with a consistency check on your RAID 10 to ensure all parity is matching across all the members and the controller. 2nd part would look to address any updates on the system including controller, BMC (which includes backplane firmware) as well as if the drives have any firmware to apply to ensure all aspects are covered in an attempt to resolve this specific issue.

eavdmeer

3 Posts

0

July 8th, 2011 07:00

I have actually done a consistency check. It finished without problems only to have the whole array go offline again after a day or two due to two consecutive remove events. Yesterday, I had one of the two other disks go offline and online again. So either 3 out of 4 disks have gone bad after a little over a year or something else is flaky.

I have updated the system firmware and the RAID controller firmware already without any effect. I am a bit puzzled about the BMC/backplane firmware. I use Fedora Core 15. All the firmware tools to update the BMC are for Redhat enterprise. Can I use any of those on Fedora?

eavdmeer

3 Posts

0

July 8th, 2011 07:00

Thanks for the suggestion. I'll give that a try next week.

I am still puzzled what the remove events signify. Do they only occur in case of physical removal of a disk (or disconnection of cables etc.) or could a drive failure cause the same event?

A

Anonymous

5 Practitioner

•

274.2K Posts

0

July 8th, 2011 07:00

With Fedora, options are rather limited. To do updates, what I can advise is to use our OMSA Live disk to boot, then run the RHEL packages. You can download all the updates to USB, boot to that ISO then run the bin files from the USB key to bring the system current.

linux.dell.com/.../OMSA64-CentOS55-x86_64-LiveCD.iso

View All

No Events found!

PowerEdge HDD/SCSI/RAID

Spontaneous remove/insert events on 2950 with perc 5/i