PowerEdge: How to fix Double Faults and Punctures in RAID Arrays

Summary: This article provides information about Double Faults and Punctures in a RAID array and it also advises how to fix the problem.

This article applies to This article does not apply to

Instructions

Fixing double faults and RAID punctures
Data Errors and Double Faults
Punctures: What Are They and How Are They Caused?
Preventing Problems Before They Happen and Solving Punctures After They Occur

Chapter 1: Fixing double faults and RAID punctures

Warning: Following these steps result in the loss of all data on the array.
Ensure you are prepared to restore from a file backup prior to following these steps.
Use caution so that following these steps does not impact any other arrays.

Discard preserved Cache (if it exists)
Clear foreign configurations (if any)
Delete the array
Check for any failed drives
Reseat any failed drives
Clear any foreign configuration again
Replace all failed drives including predictive failed drives
Update the firmware (Controller, backplane (BP), drives) if needed
Create the array
Perform a Full Initialization (not a Fast Initialization)
At this stage, the array should be ready to be used

Note: In case any drives need replacement, contact Dell support How to contact Dell EMC Enterprise Support and open an Online Case?.

Chapter 2: Data Errors and Double Faults

RAID arrays are not immune to data errors. RAID controller and hard drive firmware contain functionality to detect and correct many types of data errors before they are written to an array/drive. Using outdated firmware can result in incorrect data being written to an array/drive because it is missing the error handling/error correction features available in the latest firmware versions.

Data errors can also be caused by physical bad blocks. For example, this can occur when the read/write head impacts the spinning platter (known as a "Head Crash"). Blocks can also become bad over time due to the degradation of the platter's ability to magnetically store bits in a specific location. Bad blocks caused by platter degradation often can be successfully read. Such a bad block may only be detected intermittently or with extended diagnostics on the drives.

A bad block, also known as a bad Logical Block Address (LBA), can also be caused by logical data errors. This occurs when data is written incorrectly to a drive even though it is reported as a successful write. Also, good data stored on a drive can be changed inadvertently. One example is a "bit flip," which can occur when the read/write head passes over or writes to a nearby location and causes data, in the form of zeros and ones, to change to a different value. Such a condition causes the "consistency" of the data to become corrupted. The value of the data on a specific block is different than the original data and may no longer match the checksum of the data. The physical LBA is good and can be written to successfully, but it contains incorrect data and may be interpreted as a bad block.

Bad LBAs are commonly reported as the Sense Code 3/11/0. Sense Key 3 is a Medium Error. The additional sense code and additional sense qualifier of 11/00 is defined as an Unrecovered Read Error. There is no attempt made to correct the block and there is no determination made whether the bad block is the result of a physical defect on the drive platter or an error of the data due to other causes. The existence of a Sense Code 3/11/00 does not automatically mean that the physical drive has failed or that it should be replaced.

Dell hardware-based RAID controllers offer features such as Patrol Read and Check Consistency to correct many data error scenarios. Patrol Read operates by default as an automated background task that checks all the individual blocks on a hard drive to ensure that the data can be read correctly. Patrol Read attempts to correct blocks that are bad or remap uncorrectable blocks to reserved blocks. Check Consistency is a manually activated (it can also be scheduled) function that compares all the drives in an array against each other to ensure that the data and redundancy correctly match. For example, three drives in a RAID 5 array will be compared to ensure that the data and the parity are using the correct values. If a single error is detected, the remaining data and/or parity will be used to re-write and correct the bad value. Similarly, in a RAID 1 array, the data on one drive will be compared to the other drive to ensure that the data is mirrored correctly.

Any single error in a RAID array, if uncorrected, may cause more serious errors in the array, especially when a second error occurs. One or more single errors will not cause loss of data if the array remains in an optimal state. There is still sufficient data plus redundancy to operate normally while the array is optimal.

Due to the ability of the controller to correct for errors during normal operations, it is not always easy to detect when underlying problems in the data exist. There are rarely any errors or alerts in the controller log, hardware logs, or operating system event logs, when one or more single errors conditions exist. For this reason, an array can appear to be operating normally for a long time, despite the presence of consistency errors and/or single errors.

Figure 1: Multiple Single Faults in a RAID 5 array - Optimal Array

Figure 1: Multiple Single Faults in a RAID 5 array - Optimal Array

As shown in Figure 1, the array has multiple errors. However, since there is only a single error in any stripe, the controller can still access all the data because of the redundancy of RAID 5. If the error occurs on the parity segment, all data is intact and the error has no impact on read operations. If the error occurs in a data segment, an XOR comparison must occur between the good data and the good parity pieces to recalculate the missing/bad data segment. In either case, since there is only a single error in any stripe, there is sufficient redundancy available to access all the data successfully.

When one or more drives in a RAID array contain data errors, and another drive in the array is no longer an active member if the array due to drive failure, foreign configuration, drive removal, or any other reason, this creates a condition known as a "Double Fault." A double fault condition results in the immediate data loss of any information in the impacted stripes.

Figure 2: Double Fault with a Failed Drive (Data in Stripes 1 and 2 is lost) - Degraded Array.

It is possible that a double fault condition can occur with the array remaining in an optimal state. This would occur with identically bad LBAs on multiple hard drives. Such a condition would be rare, given the sheer number of LBAs on today larger hard drives. It would be unlikely for the same LBA on multiple hard drives to be "bad" simultaneously.

Performing regular Check Consistency operations will correct for single faults, whether a physical bad block or a logical error of the data. Check Consistency will also mitigate the risk of a double fault condition in the event of additional errors. When there is no more than a single error in any given stripe, a Check Consistency can almost always eliminate the error.

Back to Top

Chapter 3: Punctures: What Are They and How Are They Caused?

A puncture is a feature of Dell's PERC controllers designed to allow the controller to restore the redundancy of the array despite the loss of data caused by a double fault condition. Another name for a puncture is "rebuild with errors." The RAID controller detects a double fault, and because there is insufficient redundancy to recover the data in the impacted stripe the controller create a puncture in that stripe and allow the rebuild to continue.

Any condition that causes data to be inaccessible in the same stripe on more than one drive is a double fault
Double faults cause the loss of all data within the impacted stripe
All punctures are double faults but all double faults are NOT punctures

Figure 3: Punctured Stripes (Data in Stripes 1 and 2 is lost due to double fault condition) - Optimal array.

Without the puncture feature, the array rebuild would fail, and leave the array in a degraded state. Sometimes, the failures may cause additional drives to fail, and cause the array to be in a non-functioning offline state. Puncturing an array has no impact on the ability to boot to or access any data on the array. Any damage or lost data due to a double fault condition had already occurred.

Punctures can occur in one of two situations:

Double Fault already exists (Data already lost)
- Data error on an online drive is propagated (copied) to a rebuilding drive

Double Fault does not exist (Data is lost when a second error occurs)
- While in a degraded state, if a bad block occurs on an online drive, that LBA is punctured

This advantage of puncturing an array is keeping the system available in production and the redundancy of the array is restored. The data in the affected stripe was lost whether the puncture occurs or not. The primary disadvantage of the LSI method is that while the array has a puncture in it, uncorrectable errors continue to be encountered whenever the impacted data (if any) is accessed.

The puncture can occur in three locations. First, a puncture can occur in a blank space that contains no data. That stripe is inaccessible, but since there is no data in that location, it has no significant impact. Any attempts to write to a punctured stripe by an OS fails and data are written to a different location.

Second, a puncture may occur in a stripe that contains data that is not critical such as a README.TXT file. If the impacted data is not accessed, no errors are generated during normal I/O. Attempts to perform a file system backup fails to backup any files impacted by a puncture. Performing a Check Consistency or Patrol Read operations generate Sense code: 3/11/00 for the applicable LBA and/or stripes

Third, a puncture may occur in a data space that is accessed. In such as case, the lost data can cause various errors. The errors can be minor errors that do not adversely impact a production environment. The errors can also be more severe and can prevent the system from booting to an operating system, or cause applications to fail.

An array that is punctured will eventually have to be deleted and re-created to eliminate the puncture. This procedure causes all data to be erased. The data would then need to be re-created or restored from backup after the puncture is eliminated. The resolution for a puncture can be scheduled for a time that is more advantageous to the needs of the business.

If the data within a punctured stripe is accessed errors continue to be reported against the affected bad LBAs with no possible correction available. Eventually (this could be minutes, days, weeks, months, etc.), the Bad Block Management (BBM) Table fills up causing one or more drives to become marked as predictive failure. Referring to Figure 3, drive 0 will typically be the drive that gets marked as predictive failure due to the errors on drive 1 and drive 2 being propagated to it. Drive 0 may be working and replacing drive 0 will only cause that replacement to eventually be marked predictive failure as well.

A Check Consistency performed after a puncture is induced will not resolve the issue. This is why it is important to perform a Check Consistency regularly. It becomes especially important prior to replacing drives, when possible. The array must be in an optimal state to perform the Check Consistency.

A RAID array that contains a single data error with an additional error event such as a hard drive failure causes a puncture when the failed or replacement drive is rebuilt into the array. As an example, an optimal RAID 5 array includes three members: Drive 0, drive 1 and drive 2. If drive 0 fails (Figure 2) and is replaced, the data and parity remaining on drives 1 and 2 are used to rebuild the missing information back on to the replacement drive 0. However, if a data error exists on drive 1 when the rebuild operation reaches that error, there is insufficient information within the stripe to rebuild the missing data in that stripe. Drive 0 has no data, drive 1 has bad data and drive 2 has good data as it is being rebuilt. There are multiple errors within that stripe. Drive 0 and drive 1 do not contain valid data, so any data in that stripe cannot be recovered and is lost. The result as shown in Figure 3 is that punctures (in stripes 1 and 2) are created during the rebuild. The errors are propagated to drive 0.

Puncturing the array restores the redundancy and returns the array to an optimal state. This provides for the array to be protected from additional data loss in the event of additional errors or drive failures.

Back to Top

Chapter 4: Preventing Problems Before They Happen and Solving Punctures After They Occur

It can be tempting to operate under the premise, "If it is not broke, do not fix it.." While this may hold true in many areas, to best protect and manage storage subsystems, it is highly recommended to perform routine and regular maintenance. Proactive maintenance can correct existing errors, and prevent some errors from occurring. It is not possible to prevent all errors from occurring, but most serious errors can be mitigated with proactive maintenance. For storage and RAID subsystems these steps are:

Update drivers and firmware on controllers, hard drives, backplanes, and other devices
Perform routine Check Consistency operations
Review logs for indications of problems

This does not have to be a high-level technical review, but could simply be a cursory view of the logs looking for obvious indications of potential problems.
Contact Dell Technical Support with any questions or concerns

One of the most critical things that should be done is to ensure that firmware is kept up to date. Firmware is where all the logic for the operation of a device resides. It provides the functionality and the features of the device, along with various error handling and error correcting functions. Keeping firmware current can provide for better performance and fewer errors. New features and enhancements can also be added using a firmware update.

Firmware can reside in several places. RAID controllers contain firmware and each of the individual hard drives installed in a system or array. Backplanes and external enclosures also contain firmware that can impact the operation of drives and arrays contained within it.

Another proactive maintenance recommendation is to perform a "Check Consistency." Check consistency is a manual operation because it does consume a portion of the overall available bandwidth of the RAID controller. However, the check consistency can be scheduled for a time when it has the least impact on performance.

Check consistency checks for bad blocks on the drives, but more importantly it compares the data in the array to be sure that all the pieces match up correctly. When it finds a problem, it determines what the data should look like and correct it by checking the data on other drives in the array. The correction of data errors when they are relatively small is the best way to mitigate the risk of punctures caused by existing data errors with a second error or failure. The existence of double faults and punctures can cause the loss of productivity for the time necessary to restore the arrays and data to a functioning state or even the complete loss of all data.

When a double fault or puncture condition exists, there is often some data loss. If the location of these errors is in blank space or non-critical data space, the immediate impact on data in on a production environment is relatively small. However, the presence of these errors can mean that a more serious problem may exist. Hardware errors, and outdated firmware may require immediate attention.

If a known or suspected double fault or puncture condition exists follow these steps to minimize the risk of more severe problems:

Perform a Check Consistency (the array must be optimal)
Determine if hardware problems exist
Check the controller log
Perform hardware diagnostics
Contact Dell Technical Support as needed

If these steps have been done, there are additional concerns. Punctures can cause hard drives to go into a predictive failure status over time. Data errors that are propagated to a drive will be reported as media errors on the drive, even though no hardware problems exist. Each time the LBA is accessed an error is reported. Once the error log is full, the drive reports itself as predictive failure.

A single punctured LBA on a drive can be reported many times. Depending on the number of punctures, it is possible for multiple drives in an array to be reported as predictive failure. Replacing the predictive failure drive causes the existing punctures to be re-propagated to the replacement drive, which will eventually cause the replacement drive to also be marked predictive failure. In such a case, the only corrective action is to resolve the puncture condition.

Looking at Figure 3, we can see that there is a puncture on stripes 1 and 2. Replacing hard drives is not going to resolve this issue, because there is insufficient data redundancy to rebuild the original data. Any data contained in the punctured stripes is lost (unless retained in a previous backup). Remember that a puncture does not cause data loss, a double fault condition results in data loss. A puncture is the means to restore redundancy to an array that contains a double fault.

Note: Here is the process used to resolve most punctures. It may not be necessary to perform all these steps to resolve. If following these steps does not resolve an issue, contact Dell Technical Support for further assistance.

If the check consistency completes without errors, you can safely assume that the array is now healthy and the puncture is removed. Data can now be restored to the healthy array.

In more severe cases, the issue may not be resolved and errors can persist despite following these steps. If following these steps does not resolve an issue, contact Dell Technical Support for further assistance.

It may be necessary to analyze the punctures in more detail to determine which drives are in common. For example, in Figure 3, the controller log would show a puncture between Disks 0 and 1, and a puncture between Disks 0 and 2. Disk 0 is the common drive. Follow the same steps above, but completely remove the common drives first. So using the example in Figure 1, remove Disk 0, and then follow the steps outlined. Create the array using the remaining Disks (1 and 2). Once complete and after a check consistency determines that the array is healthy, then add Disk 0 back in and either perform the steps again with all the drives, or using the RLM (RAID level migration) and/or OCE (Online Capacity Expansion) features to add the remaining drives back into the array.

Any drives that are marked predictive failure should be removed and not in the recovery process. Again using Figure 3 as an example, if Disk 0 was predictive failure, remove this drive. Then perform the steps as outlined above. Since there are only two drives remaining, the RAID array created is a RAID 1 instead of a RAID 5.After obtaining a replacement Disk 0 (due to the predictive failure) perform the steps again, including all three drives, or add Disk 0 into the existing array using RLM and change it from a RAID 1 with 2 drives into a RAID 5 with 3 drives.

The process can be daunting, especially considering the potential for data loss. The adage, "An ounce of prevention is worth a pound of cure" is true here. Experience has shown that almost all double fault and puncture conditions could have been avoided by performing proactive maintenance on RAID hardware and arrays.

Note: Monitoring the system allows problems to be detected and corrected in a timely manner which also reduces the risk of more serious problems.

Affected Products

Servers

PowerEdge: How to fix Double Faults and Punctures in RAID Arrays

Summary: This article provides information about Double Faults and Punctures in a RAID array and it also advises how to fix the problem.

Instructions

Chapter 1: Fixing double faults and RAID punctures

Chapter 2: Data Errors and Double Faults

Chapter 3: Punctures: What Are They and How Are They Caused?

Chapter 4: Preventing Problems Before They Happen and Solving Punctures After They Occur

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

PowerEdge: How to fix Double Faults and Punctures in RAID Arrays

Summary: This article provides information about Double Faults and Punctures in a RAID array and it also advises how to fix the problem.

Detailed Article

Instructions

Affected Products

Instructions

Chapter 1: Fixing double faults and RAID punctures

Chapter 2: Data Errors and Double Faults

Chapter 3: Punctures: What Are They and How Are They Caused?

Chapter 4: Preventing Problems Before They Happen and Solving Punctures After They Occur

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services