This article provides information on Data Errors, Double Faults and Punctures in a RAID array. Additionally there are recommendations to prevent/mitigate these problems, and how to resolve issues after they have occurred.
Table of content
- Data Errors and Double Faults
- Punctures: What Are They and How Are They Caused?
- Preventing Problems Before They Happen and Solving Punctures After They Occur
Chapter 1 : Data Errors and Double Faults
RAID arrays are not immune to data errors. RAID controller and hard drive firmware contain functionality to detect and correct many types of data errors before they are written to an array/drive. Using outdated firmware can result in incorrect data being written to an array/drive because it is missing the error handling/error correction features available in the latest firmware versions.
Data errors can also be caused by physical bad blocks. For example, this can occur when the read/write head impacts the spinning platter (known as a "Head Crash"). Blocks can also become bad over time due to the degradation of the platter's ability to magnetically store bits in a specific location. Bad blocks caused by platter degradation often can be successfully read. Such a bad block may only be detected intermittently or with extended diagnostics on the drives.
A bad block, also known as a bad Logical Block Address (LBA), can also be caused by logical data errors. This occurs when data is written incorrectly to a drive even though it is reported as a successful write. Additionally, good data stored on a drive can be changed inadvertently. One example is a "bit flip", which can occur when the read/write head passes over or writes to a nearby location and causes data, in the form of zeros and ones, to change to a different value. Such a condition causes the "consistency" of the data to become corrupted. The value of the data on a specific block is different than the original data and may no longer match the checksum of the data. The physical LBA is good and can be written to successfully, but it currently contains incorrect data and may be interpreted as a bad block.
Bad LBAs are commonly reported as the Sense Code 3/11/0
. Sense Key 3 is a Medium Error
. The additional sense code and additional sense qualifier of 11/00 is defined as an Unrecovered Read Error
. There is no attempt made to correct the block and there is no determination made as to whether the bad block is the result of a physical defect on the drive platter or an error of the data due to other causes. The existence of a Sense Code 3/11/00 does not automatically mean that the physical drive has failed or that it should be replaced.
Dell hardware-based RAID controllers offer features such as Patrol Read and Check Consistency to correct many data error scenarios. Patrol Read operates by default as an automated background task that checks all of the individual blocks on a hard drive to ensure that the data can be read correctly. Patrol Read will attempt to correct blocks that are bad or remap un-correctable blocks to reserved blocks. Check Consistency is a manually activated (it can also be scheduled) function that compares all the drives in an array against each other to ensure that the data and redundancy correctly match. For example, three drives in a RAID 5 array will be compared to ensure that the data and the parity are using the correct values. If a single error is detected, the remaining data and/or parity will be used to re-write and correct the bad value. Similarly, in a RAID 1 array, the data on one drive will be compared to the other drive to ensure that the data is mirrored correctly.
Any single error in a RAID array, if uncorrected, may cause more serious errors in the array, especially when a second error occurs. One or more single errors will not cause loss of data as long as the array remains in an optimal state. There is still sufficient data plus redundancy to operate normally while the array is optimal.
Due to the ability of the controller to correct for errors during normal operations, it is not always easy to detect when underlying problems in the data exist. There are rarely any errors or alerts in the controller log, hardware logs, or operating system event logs, when a one or more single errors conditions exist. For this reason, an array can appear to be operating normally for a very long time, despite the presence of consistency errors and/or single errors.
Multiple Single Faults in a RAID 5 array - Optimal Array
As shown in Figure 1, the array has multiple errors. However, since there is only a single error in any stripe, the controller can still access all the data because of the redundancy of RAID 5. If the error occurs on the parity segment, all data is intact and the error has no impact on read operations. If the error occurs in a data segment, a XOR comparison must occur between the good data and the good parity pieces to recalculate the missing/bad data segment. In either case, since there is only a single error in any stripe, there is sufficient redundancy available to access all of the data successfully.
When one or more drives in a RAID array contain data errors, and another drive in the array is no longer an active member if the array due to drive failure, foreign configuration, drive removal, or any other reason, this creates a condition known as a "Double Fault". A double fault condition results in the immediate data loss of any information in the impacted stripes.
Double Fault with a Failed Drive (Data in Stripes 1 and 2 is lost) - Degraded Array
It is possible that that a double fault condition can occur with the array remaining in an optimal state. This would occur with identical bad LBAs on multiple hard drives. Such a condition would be extremely rare, given the sheer number of LBAs on today larger hard drives. It would be very unlikely for the same LBA on multiple hard drives to be "bad" at the same time.
Performing regular Check Consistency operations will correct for single faults, whether a physical bad block or a logical error of the data. Check Consistency will also mitigate the risk of a double fault condition in the event of additional errors. When there is no more than a single error in any given stripe, a Check Consistency can almost always eliminate the error.
Back to Top
Chapter 2: Punctures: What Are They and How Are They Caused?
A puncture is a feature of Dell's PERC controllers designed to allow the controller to restore the redundancy of the array despite the loss of data caused by a double fault condition. Another name for a puncture is "rebuild with errors". The RAID controller will detect a double fault, and because there is insufficient redundancy to recover the data in the impacted stripe the controller create a puncture in that stripe and allow the rebuild to continue.
- Any condition that causes data to be inaccessible in the same stripe on more than one drive is a double fault
- Double faults cause the loss of all data within the impacted stripe
- All punctures are double faults but all double faults are NOT punctures
Punctured Stripes (Data in Stripes 1 and 2 is lost due to double fault condition) - Optimal array
Without the puncture feature, the array rebuild would fail, and leave the array in a degraded state. In some cases, the failures may cause additional drives to fail, and cause the array to be in a non-functioning offline state. Puncturing an array has no impact on the ability to boot to or access any data on the array. Any damage or lost data due to a double fault condition had already occurred.
Punctures can occur in one of two situations:
- Double Fault already exists (Data already lost)
- Data error on an online drive is propagated (copied) to a rebuilding drive
- Double Fault does not exist (Data is lost when second error occurs)
- While in a degraded state, if a bad block occurs on an online drive, that LBA is punctured
This advantage of puncturing an array is keeping the system available in production and the redundancy of the array is restored. The data in the affected stripe was lost whether the puncture occurs or not. The primary disadvantage of the LSI method is that while the array has a puncture in it, uncorrectable errors will continue to be encountered whenever the impacted data (if any) is accessed.
Puncture can occur in three locations. First, a puncture can occur in blank space that contains no data. That stripe will be inaccessible, but since there is no data in that location, it will have no significant impact. Any attempts to write to a punctured stripe by an OS will fail and data will be written to a different location.
Second, a puncture may occur in a stripe that contains data that isn't critical such as a README.TXT file. If the impacted data is not accessed, no errors are generated during normal I/O. Attempts to perform a file system backup will fail to backup any files impacted by a puncture. Performing a Check Consistency or Patrol Read operations will generate Sense code: 3/11/00 for the applicable LBA and/or stripes.
Third, a puncture may occur in data space that is accessed. In such as case, the lost data can cause a variety of errors. The errors can be minor errors that do not adversely impact a production environment. The errors can also be more severe and can prevent the system from booting to an operating system, or cause applications to fail.
An array that is punctured will eventually have to be deleted and recreated to eliminate the puncture. This procedure causes all data to be erased. The data would then need to be recreated or restored from backup after the punctured is eliminated. The resolution for a puncture can be scheduled for a time that is more advantageous to needs of the business.
If the data within a punctured stripe is accessed errors will continue to be reported against the affected badLBAs with no possible correction available. Eventually (this could be minutes, days, weeks, months, etc.), the Bad Block Management (BBM) Table will fill up causing one or more drives to become flagged as predictive failure. Referring to Figure 3, drive 0 will typically be the drive that gets flagged as predictive failure due to the errors on drive 1 and drive 2 being propagated to it. Drive 0 may actually be working normally, and replacing drive 0 will only cause that replacement to eventually be flagged predictive failure as well.
A Check Consistency performed after a puncture is induced will not resolve the issue. This is why it is very important to perform a Check Consistency on a regular basis. It becomes especially important prior to replacing drives, when possible. The array must be in an optimal state to perform the Check Consistency.
A RAID array that contains a single data error in conjunction with an additional error event such as a hard drive failure causes a puncture when the failed or replacement drive is rebuilt into the array. As an example, an optimal RAID 5 array includes three members: drive 0, drive 1 and drive 2. If drive 0 fails (Figure 2) and is replaced, the data and parity remaining on drives 1 and 2 are used to rebuild the missing information back on to the replacement drive 0. However, if a data error exists on drive 1 when the rebuild operation reaches that error, there is insufficient information within the stripe to rebuild the missing data in that stripe. Drive 0 has no data, drive 1 has bad data and drive 2 has good data as it is being rebuilt. There are multiple errors within that stripe. Drive 0 and drive 1 do not contain valid data, so any data in that stripe cannot be recovered and is therefore lost. The result as shown in Figure 3 is that punctures (in stripes 1 and 2) are created during the rebuild. The errors are propagated to drive 0.
Puncturing the array restores the redundancy and returns the array to an optimal state. This provides for the array to be protected from additional data loss in the event of additional errors or drive failures.
Back to Top
Chapter 3: Preventing Problems Before They Happen & Solving Punctures After They Occur
It can be tempting to operate under the premise, "If it isn't broke, don't fix it.". While this may hold true in many areas, to best protect and manage storage subsystems, it is highly recommended to perform routine and regular maintenance. Proactive maintenance can correct existing errors, and prevent some errors from occurring. It is not possible to prevent all errors from occurring, but most serious errors can be mitigated significantly with proactive maintenance. For storage and RAID subsystems these steps are:
- Update drivers and firmware on controllers, hard drives, backplanes and other devices
- Perform routine Check Consistency operations
- Review logs for indications of problems
This doesn’t have to be a high level technical review, but could simply be a cursory view of the logs looking for extremely obvious indications of potential problems
Contact Dell Technical Support with any questions or concerns
One of the most critical things that should be done is ensure that firmware is kept up to date. Firmware is where all of the logic for operation of a device resides. It provides the functionality and the features of the device, along with a variety of error handling and error correcting functions. Keeping firmware current can provide for better performance and fewer errors. New features and enhancements can also be added via a firmware update.
Firmware can reside in a number of places. RAID controllers contain firmware as well as each of the individual hard drives installed in a system or array. Backplanes and external enclosures also contain firmware that can impact the operation of drives and arrays contained within it.
Another proactive maintenance recommendation is to perform a "Check Consistency". Check consistency is a manual operation because it does consume a portion of the overall available bandwidth of the RAID controller. However, the check consistency can be scheduled for a time when it has the least impact on performance.
will check for bad blocks on the drives, but more importantly it will compare the data in the array to be sure that all the pieces match up correctly. When it finds a problem, it will determine what the data should look like and correct it by checking the data on other drives in the array. The correction of data errors when they are relatively small is the best way to mitigate the risk of punctures caused by existing data errors in conjunction with a second error or failure. The existence of double faults and punctures can cause the loss of productivity for the time necessary to restore the array(s) and data to a functioning state or even the complete loss of all data.
When a double fault or puncture condition exists
, there is often some data loss. If the location of these errors is in blank space or non-critical data space, the immediate impact on data in on a production environment is relatively small. However, the presence of these errors can mean that a more serious problems may exist.Hardware errors, and outdated firmware may require immediate attention.
If a known or suspected double fault or puncture condition exists follow these steps to minimize the risk of more severe problems:
- Perform a Check Consistency (array must be optimal)
- Determine if hardware problems exist
- Check the controller log
- Perform hardware diagnostics
- Contact Dell Technical Support as needed
If these steps have been done, there are additional concerns. Punctures can cause hard drives to go into a predictive failure status over time. Data errors that are propagated to a drive will be reported as media errors on the drive, even though no actual hardware problems exist. Each time the LBA is accessed an error is reported.Once the error log is full, the drive will report itself as predictive failure.
A single punctured LBA on a drive can be reported many times. Depending on the number of punctures, it is possible for multiple drives in an array to be reported as predictive failure. Replacing the predictive failure drive will cause the existing punctures to be re-propagated to the replacement drive, which will eventually cause the replacement drive to also be flagged predictive failure. In such a case, the only corrective action is to resolve the puncture condition.
Looking at Figure 3, we can see that there is a puncture on stripes 1 and 2. Replacing hard drives is not going to resolve this issue, because there is insufficient data redundancy to rebuild the original data. Any data contained in the punctured stripes is lost (unless retained in a previous backup). Remember that a puncture does not cause data loss, a double fault condition results in data loss. A puncture is the means to restore redundancy to an array that contains a double fault.
Note: Here is the process used to resolve most punctures. It may not be necessary to perform all of these steps to resolve. If following these steps does not resolve an issue, contact Dell Technical Support for further assistance.
Warning: Following these steps will result in the loss of all data on the array. Please ensure you are prepared to restore from backup or other means prior to following these steps. Use caution so that following these steps does not impact any other arrays.
- Discard Preserved Cache (if it exists)
- Clear foreign configurations (if any)
- Delete the array
- Shift the position of the drives by one (using Figure 1, move Disk 0 to slot 1, Disk 1 to slot 2, and Disk 2 to slot 0)
- Recreate the array as desired
- Perform a Full Initialization of the array (not a Fast Initialization)
- Perform a Check Consistency on the array
If the check consistency completes without errors, you can safely assume that the array is now healthy and the puncture is removed. Data can now be restored to the healthy array.
In more severe cases, the issue may not be resolved and errors can persist despite following these steps. If following these steps does not resolve an issue, contact Dell Technical Support for further assistance.
It may be necessary to analyze the punctures in more detail to determine which drives are in common. For example, in Figure 3, the controller log would show a puncture between Disks 0 and 1, and a puncture between Disks 0 and 2. Disk 0 is the common drive. Follow the same steps above, but completely remove the common drives first. So using the example in Figure 1, remove Disk 0, and then follow the steps outlined. Create the array using the remaining Disks (1 and 2). Once complete and after a check consistency determines that the array is healthy, then add Disk 0 back in and either perform the steps again with all the drives, or using the RLM (RAID level migration) and/or OCE (Online Capacity Expansion) features to add the remaining drive(s) back into the array.
Any drives that are flagged predictive failure should be removed and not included in the recovery process. Again using Figure 3 as an example, if Disk 0 was predictive failure, remove this drive. Then perform the steps as outlined above. Since there are only 2 drives remaining, the RAID array created is a RAID 1 instead of a RAID 5.After obtaining a replacement Disk 0 (due to the predictive failure) perform the steps again, including all 3 drives, or add Disk 0 into the existing array using RLM and change it from a RAID 1 with 2 drives into a RAID 5 with 3 drives.
The process can be daunting, especially considering the potential for data loss. The adage, "An ounce of prevention is worth a pound of cure" is certainly true here. Experience has shown that almost all double fault and puncture conditions could have been avoided by performing proactive maintenance on RAID hardware and arrays.
Note: Effectively monitoring the system allows problems to be detected and corrected in a timely manner which also reduces the risk of more serious problems.
Back to top