I have a predictive disk failure for a few days on a PowerEdge R610. I replaced the disk ( 73GB on a RAID-1 array). I saw some message in /var/log/messages:
Jun 23 11:26:09 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 2 id=0 Jun 23 11:26:09 daphne-d kernel: mptbase: ioc0: PhysDisk is now online, out of sync Jun 23 11:26:09 daphne-d kernel: mptsas: ioc0: WARNING - firmware bug: unable to add hidden disk - target_id matchs volume_id Jun 23 11:26:10 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0 Jun 23 11:26:10 daphne-d kernel: mptbase: ioc0: volume is now degraded, enabled, resync in progress
and later:
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 2 id=10 Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: PhysDisk is now online Jun 23 12:27:11 daphne-d kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 10, phy 0, sas_addr 0x500000e118ff5022 Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0 Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: volume is now optimal, enabled, resync in progress Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0 Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: volume is now optimal, enabled Jun 23 12:27:11 daphne-d kernel: scsi 2:0:7:0: Direct-Access FUJITSU MBE2073RC D906 PQ: 0 ANSI: 5 Jun 23 12:27:11 daphne-d kernel: scsi 2:0:7:0: Attached scsi generic sg1 type 0
From that I understand that the resync was done. However, I never saw the usual state Rebuilding in omreport and still now the disk report Predictive Failure:
Controller SAS 6/iR Integrated (Embedded) ID : 0:0:0 Status : Non-Critical Name : Physical Disk 0:0:0 State : Online Power Status : Not Applicable Bus Protocol : SAS Media : HDD Part of Cache Pool : Not Applicable Remaining Rated Write Endurance : Not Applicable Failure Predicted : Yes Revision : SM04 Driver Version : Not Applicable Model Number : Not Applicable T10 PI Capable : No Certified : Not Applicable Encryption Capable : No Encrypted : Not Applicable Progress : Not Applicable Mirror Set ID : Not Applicable Capacity : 67.75 GB (72746008576 bytes) Used RAID Disk Space : 67.75 GB (72746008576 bytes) Available RAID Disk Space : 0.00 GB (0 bytes) Hot Spare : No Vendor ID : DELL Product ID : ST973451SS Serial No. : 3PD22N7N Part Number : SG0XT7641253189L01ACA00 Negotiated Speed : 3.00 Gbps Capable Speed : 3.00 Gbps PCIe Maximum Link Width : Not Applicable PCIe Negotiated Link Width : Not Applicable Sector Size : 512B Device Write Cache : Not Applicable Manufacture Day : 08 Manufacture Week : 39 Manufacture Year : 2005 SAS Address : 5000C5000EC1251D
Also, every night now I get messages in /var/log/messages:
Jun 21 03:23:57 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0 Jun 22 03:34:28 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0 Jun 23 03:45:02 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0 Jun 24 03:55:40 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0 Jun 25 04:06:10 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
I replaced the disk on June 23rd.
I am not sure that the new disk was really accepted by the system now. Why do I still have a predictive disk failure ? Would a reboot help ? Any suggestion on how to get rid of the predictive failure ?
A reboot may help, I've seen reporting issues like this be solved through a power cycle. This controller doesn't maintain a hardware level RAID log, though. If a reboot doesn't correct the reporting troubleshooting options will be quite limited.
Based on the log output you have, it does look like the system thinks the drive is still bad. Restarting would be a good idea. You might also consider reseating storage cabling in that same window.
GSoucy
1 Rookie
•
35 Posts
0
June 25th, 2021 07:00
Hello
I have a predictive disk failure for a few days on a PowerEdge R610. I replaced the disk ( 73GB on a RAID-1 array). I saw some message in /var/log/messages:
Jun 23 11:26:09 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 2 id=0
Jun 23 11:26:09 daphne-d kernel: mptbase: ioc0: PhysDisk is now online, out of sync
Jun 23 11:26:09 daphne-d kernel: mptsas: ioc0: WARNING - firmware bug: unable to add hidden disk - target_id matchs volume_id
Jun 23 11:26:10 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jun 23 11:26:10 daphne-d kernel: mptbase: ioc0: volume is now degraded, enabled, resync in progress
and later:
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 2 id=10
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: PhysDisk is now online
Jun 23 12:27:11 daphne-d kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 10, phy 0, sas_addr 0x500000e118ff5022
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: volume is now optimal, enabled, resync in progress
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jun 23 12:27:11 daphne-d kernel: mptbase: ioc0: volume is now optimal, enabled
Jun 23 12:27:11 daphne-d kernel: scsi 2:0:7:0: Direct-Access FUJITSU MBE2073RC D906 PQ: 0 ANSI: 5
Jun 23 12:27:11 daphne-d kernel: scsi 2:0:7:0: Attached scsi generic sg1 type 0
From that I understand that the resync was done. However, I never saw the usual state Rebuilding in omreport and still now the disk report Predictive Failure:
Controller SAS 6/iR Integrated (Embedded)
ID : 0:0:0
Status : Non-Critical
Name : Physical Disk 0:0:0
State : Online
Power Status : Not Applicable
Bus Protocol : SAS
Media : HDD
Part of Cache Pool : Not Applicable
Remaining Rated Write Endurance : Not Applicable
Failure Predicted : Yes
Revision : SM04
Driver Version : Not Applicable
Model Number : Not Applicable
T10 PI Capable : No
Certified : Not Applicable
Encryption Capable : No
Encrypted : Not Applicable
Progress : Not Applicable
Mirror Set ID : Not Applicable
Capacity : 67.75 GB (72746008576 bytes)
Used RAID Disk Space : 67.75 GB (72746008576 bytes)
Available RAID Disk Space : 0.00 GB (0 bytes)
Hot Spare : No
Vendor ID : DELL
Product ID : ST973451SS
Serial No. : 3PD22N7N
Part Number : SG0XT7641253189L01ACA00
Negotiated Speed : 3.00 Gbps
Capable Speed : 3.00 Gbps
PCIe Maximum Link Width : Not Applicable
PCIe Negotiated Link Width : Not Applicable
Sector Size : 512B
Device Write Cache : Not Applicable
Manufacture Day : 08
Manufacture Week : 39
Manufacture Year : 2005
SAS Address : 5000C5000EC1251D
Also, every night now I get messages in /var/log/messages:
Jun 21 03:23:57 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
Jun 22 03:34:28 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
Jun 23 03:45:02 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
Jun 24 03:55:40 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
Jun 25 04:06:10 daphne-d Server_Administrator: 3641 2094 - Storage Service Predictive Failure reported: Physical Disk 0:0:0 Controller 0, Connector 0
I replaced the disk on June 23rd.
I am not sure that the new disk was really accepted by the system now. Why do I still have a predictive disk failure ? Would a reboot help ? Any suggestion on how to get rid of the predictive failure ?
Thank you in advance
Gilbert
Dell-DylanJ
4 Operator
•
2.9K Posts
0
June 25th, 2021 09:00
Hello,
A reboot may help, I've seen reporting issues like this be solved through a power cycle. This controller doesn't maintain a hardware level RAID log, though. If a reboot doesn't correct the reporting troubleshooting options will be quite limited.
Based on the log output you have, it does look like the system thinks the drive is still bad. Restarting would be a good idea. You might also consider reseating storage cabling in that same window.