Unsolved
This post is more than 5 years old
51 Posts
0
69849
Consistency Check sometimes finds inconsistencies - Raid 1+0
We regularly find that a Raid Consistency Check on our new T420 server array finds a small number of inconsistencies. This is a 4-disk Raid 1+0 array. Any idea what might be causing this?
This Perc 310 Raid controller card and cables have been replaced but still using the original backplane and three of the disks. One of the disks is new because it was replaced when it failed suddenly without warning.
It has been suggested that this is caused by 'corrupted parity in the Raid header' but don't understand how that can be when there is no parity involved in a Raid 1+0 array!
Have exported and checked the LSI log and note errors:
I2C 1 cannot find idle bus!
Spurious interrupt received in task mode
Any suggestions what is causing this?
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
May 18th, 2014 17:00
Hello
Inconsistencies in the array are usually caused by bad blocks. These can be bad physical or logical blocks. Blocks fail on HDDs all the time, so it is not uncommon for there to be inconsistencies. If you are running consistency checks on a regular basis and it is always repairing issues then there might be a problem.
If this started occurring around the time of that drive failure then I would suspect the array may be corrupted.
I'm not sure what that means. The "RAID header" would likely be referring to the metadata tags that describe the array. The metadata tags contain information like the number of disks in the array, stripe size, and overall array size. If this information was corrupted then the data would not likely be accessible.
What could be corrupted are the logical blocks. If the drive failed due to a head crash then it is possible that corrupted logical blocks would be copied around to the other drives before it went offline. This is similar to file system corruption for a partition. You would need to delete the array, create a dissimilar array(anything other than RAID 10), initialize it, and then once it finishes initializing delete and recreate your RAID 10. This will get the controller to rewrite it's file system.
The other two possibilities I can think of are physical block failures on the HDD's, and a communication problem causing data to be corrupted. I think it is unlikely that it would be physical block failures because you should be getting errors if that is occurring. The SMART on the HDD should notify you if there are bad blocks occurring. Like I said earlier blocks fail on HDDs all the time. The SMART only notifies you if a certain threshold of bad blocks is exceeded, so it is possible that you have bad blocks and are not receiving errors on them.
The other issue of communication could be cabling or backplane. You should also make sure the firmware throughout the system is up to date, and the PERC driver is at a compatible revision to the PERC firmware.
If you are receiving block addresses for the repairs then that would be helpful. If the drive and location of the block are listed then we could narrow the problem down to a single drive. If the blocks are sequential then that would be a sign of a head crash.
Those can be ignored. They are erroneous messages that commonly occur in the logs.
Thanks
NickC_UK
51 Posts
0
May 20th, 2014 07:00
Hi Dell-Daniel,
Thanks for your detailed reply.
After having the Raid Controller and connecting cables replaced we have Consistency Checks of:
Surely the fact that check 2. got zero inconsistencies indicates that there were no logical block errors doesn't it? If so where has the one inconsistency found in check 3. come from? Surely it is no coincidence that the address of the inconsistency is remarkable close to the previous inconsistencies found in consistency check 1.
We think the consistency checks were failing before the drive was replaced.
I'm not sure what that means. The "RAID header" would likely be referring to the metadata tags that describe the array. The metadata tags contain information like the number of disks in the array, stripe size, and overall array size. If this information was corrupted then the data would not likely be accessible.[/quote]
Quite, so what are Dell support talking about when they tell us that there is 'corrupted parity in the Raid header'?
Doesn't the fact that we have had a clean consistency check indicate that there are no corrupted logical blocks present?
Controller Card and connecting cables already replaced, Dell would not replace the backplane. Might be worth us trying again to try and get that backplane replaced.
Those can be ignored. They are erroneous messages that commonly occur in the logs.
Are you absolutely sure about that, we had been told by Dell support that they are 'Related to bad parity/puncturing'!
Thanks,
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
May 24th, 2014 10:00
It looks like logical block issues caused by a drive failure. When logical block addresses are that closely matched it is usually indicative of a head crash. I think you are getting something confused with how the LBAs work. An LBA does not span the entire RAID array.
They may have used terminology improperly, you may be misquoting them, or they may be talking about something neither of us are familiar with, I don't know. The term "RAID Header" is commonly used to describe the metadata tags. The reference to parity in the RAID header is not something I'm familiar with.
I have no idea what you mean by this statement. It is my understanding that the consistency checks are not coming back clean. Isn't that the whole point of this conversation? When the controller runs a consistency check it goes through and verifies the data on the array. If it finds errors it uses the redundancy from the other drives in the array to correct the errors.
Yes, I'm sure. That message is related to communication with the iDRAC. It has nothing to do with your issue. If the message occurs occasionally then it is normal. If your logs are flooded with it then there is likely a problem.
Thanks
NickC_UK
51 Posts
0
May 27th, 2014 14:00
I have no idea what you mean by this statement. It is my understanding that the consistency checks are not coming back clean. Isn't that the whole point of this conversation? When the controller runs a consistency check it goes through and verifies the data on the array. If it finds errors it uses the redundancy from the other drives in the array to correct the errors.
[/quote]
Some Consistency checks are coming back clean but some are not, see timeline below:
So the fact that we had two clean consistency checks suggests to me that at that time the array was fully consistent. If so what is causing the inconsistencies to return later, hardware fault? That can only be the backplane or one of the disks but none of the disks are finding any hardware errors.
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
May 28th, 2014 10:00
This is likely a logical issue. You should delete, create a dissimilar array, initialize, and then after initialization completes delete and create your desired array. All of the block addresses are close together. This happens when a head crash occurs. The head will skip across the plates and cause damage to multiple areas. The likely reason that the check failed after passing was that between the checks there was an attempt to write to the block.
pcmeiners
4 Operator
4 Operator
•
1.8K Posts
0
May 28th, 2014 14:00
"So what happens if we write some software which cycles through each partition on the array and writes something to all unused sectors"
That where Patrol Reads come in, it checks EVERY sector of array drives, not only sectors containing data. Ps. Partitions are at the OS level, not at raid level.
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=0CEwQFjAF&url=http%3A%2F%2Fwww.dell.com%2Fdownloads%2Fglobal%2Fpower%2Fps1q06-20050212-Habas.pdf&ei=5EeGU76bHNSIuATl4oH4Cw&usg=AFQjCNEQE9tt0X4xAm_fE0aFIlgadsJkSw
"Simpler than that, what about if we just take each disk offline individually and after reattaching, rebuild the array"
Not practical/safe, as each rebuild has an element of danger involved, should another disk(s) in the array develop an issue, during any of the rebuilds, the array fails.
Also testing would be needed to be done on a similar control, not a standard HBA, as standard disk controllers do not test disks which will be placed on a raid controller sufficiently.
NickC_UK
51 Posts
0
May 28th, 2014 14:00
So what happens if we write some software which cycles through each partition on the array and writes something to all unused sectors. Then run a Consistency Check which will fix fix any remaining logical errors?
Edit:
Simpler than that, what about if we just take each disk offline individually and after reattaching, rebuild the array. After that has been done to all four disks there can not be any logical errors let can there.