This article provides troubleshooting steps for (puncturing) bad blocks on HDDs in PowerEdge servers with PERC-controllers. Especially when no backup is possible, the following information may help to bring an impacted Virtual Drive back to an optimal state.
The OpenManage Server Administrator (OMSA) shows a red cross in front of a Virtual Disk (Figure 1).
Figure 1: Virtual Disk with red cross in Status (example H800)
The Windows System Log shows Bad Block errors (Figure 2).
Figure 2: Bad Block error in Windows System Log shown
The RAID controller log (TTYLOG) shows errors like:
02/26/15 13:43:39: EVT#131878-02/26/15 13:43:39: 97=Puncturing bad block on PD XX(e0x20/s2) at 180ca4a1f
Find more information about receiving these specific logs in our article about gathering logs.
RAID arrays are not immune to data errors. RAID controller and hard drive firmware contain functionality to detect and correct many types of data errors before they are written to an array/drive. Using outdated firmware can result in incorrect data being written to an array/drive because it is missing the error handling/error correction features available in the latest firmware versions.
Data errors can also be caused by physical bad blocks. For example, this can occur when the read/write head impacts the spinning platter (known as a "Head Crash"). Blocks can also become bad over time due to the degradation of the platter's ability to magnetically store bits in a specific location. Bad blocks caused by platter degradation often can be successfully read. Such a bad block may only be detected intermittently or with extended diagnostics on the drives.
A bad block, also known as a bad Logical Block Address (LBA), can also be caused by logical data errors. This occurs when data is written incorrectly to a drive even though it is reported as a successful write. Additionally, good data stored on a drive can be changed inadvertently. One example is a "bit flip", which can occur when the read/write head passes over or writes to a nearby location and causes data, in the form of zeros and ones, to change to a different value. Such a condition causes the "consistency" of the data to become corrupted. The value of the data on a specific block is different than the original data and may no longer match the checksum of the data. The physical LBA is good and can be written to successfully, but it currently contains incorrect data and may be interpreted as a bad block.
For more information read our article about Double Faults and Punctures in RAID Arrays.
Create a validated data backup on file level
Ensure that all failed drives showing predictive failures are replaced
Delete and recreate the Virtual Disk
Perform a Full Initialization of the VD
Perform a Check Consistency on the new created VD
The data can now be restored to the healthy VD
Recommendation: Upgrade all the hard disks firmware to latest version
OMSA provides the ability to clear the bad block warnings. To clear bad blocks, the following procedure is recommended:
When performing a backup of the virtual disk with the Verify option selected, two scenarios can occur:
Run Patrol Read (under Virtual Disk Tasks in OMSA) and check the system event log to ensure that no new bad blocks are found. If bad blocks still exist, proceed to next step. If not, the condition is cleared.
To clear these bad blocks, execute the Clear Virtual Disk Bad Blocks task. This can be done in the OMSA GUI or use the cli command:
omconfig storage vdisk action=clearvdbadblocks controller=id vdisk=id
omreport storage controller
to display the controller IDs, and then type omreport storage vdisk controller=ID
to display the IDs for the virtual disks