Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

2085

February 8th, 2012 11:00

data loss comparison - raid5 vs raid10 vs raid6

Hello,

Does anyone have any information on estimating potential data loss chances due to raid group failure depending on raid type?  I'm putting together a detailed case for converting from heavy raid10 usage to raid5/6.  The IO considerations, performance, etc. are collected, but the general statement 'lack of redundancy' defined by a quick google search by the business I support needs to be addressed.  Especially with the concern about losing an entire vp pool with such a failure.

ie

25 x 3+1 R5 groups of 300GB FC drives in a virtual pool have a .005% chance to have a double disk failure within the same raid group vs .0008% chance for a virtual pool consisting of 50 x 2-way mirrors of 300GB FC drives.

I've read many articles, and even used the MTTDLcalc spreadsheet (how do you set rg size?  Is the percentage predicting a single disk or double disk failure? If I specify 300 drives in my array it predicts a 50% chance of data loss in the first year which doesn't seem correct) floating around the Internet, but can't seem to find a definitive answer.  Has anyone else dealt with this?

Thanks!

Scott

1.3K Posts

February 9th, 2012 14:00

I just uploaded a simple spreadsheet I did in a statisics class to ftp.emc.com

It is in the /incoming/jadams/RAID directory.  You can change the variables if you want.

1.3K Posts

February 8th, 2012 16:00

First off the chance of data loss from a dual drive failure on RAID6 is ZERO.  You must lose three drives in the raid group before you have lost data. 

The calculations for RAID1 or RAID5 are fairly simple, the RAID6 caclulations on the other hand are very complex.  I think for any system equiped with spares, the chance of data loss due to drive failures with RAID6 should be considered zero.

465 Posts

February 8th, 2012 16:00

Hi Scott,

RAID 5 7+1 has the highest probability of a DL due to multiple drive failure.

In very approximate terms, with the 300GB drive example and a mean time between failure (MTBF) of 200K hours, here is how the other RAID types stack up over time against RAID 5 7+1... On AVERAGE:

RAID 5 3+1 will take twice a long before a DL over 7+1;

RAID 1 will last 6 time longer;

RAID 6(14) will last 400 times longer;

RAID 6(6) will last around 2500 times longer before a DL.

Numerous assumptions are made to determine these figures so please consider this as a rough guide.

The table below indicates the relative reliability between RAID types. The numbers above are from the first data column in the table.

.

7RAID5 3RAID5 RAID1 14RAID6 6RAID6
7RAID5 1 0.464286 0.1625 0.00236364 0.000394
3RAID5 2.153846 1 0.35 0.00509091 0.000848
RAID1 6.153846 2.857143 1 0.01454545 0.002424
14RAID6 423.0769 196.4286 68.75 1 0.166667
6RAID6 2538.462 1178.571 412.5 6 1

19 Posts

February 9th, 2012 07:00

Quincy56 wrote:

First off the chance of data loss from a dual drive failure on RAID6 is ZERO.  You must lose three drives in the raid group before you have lost data. 

The calculations for RAID1 or RAID5 are fairly simple, the RAID6 caclulations on the other hand are very complex.  I think for any system equiped with spares, the chance of data loss due to drive failures with RAID6 should be considered zero.

Correct, that is why I used R5 groups in my example.  Can you provide the calculations for R1 and R5?  If R6 is overly complex, those should be fine for me to start with.  Thanks!

19 Posts

February 9th, 2012 07:00

Jasonc wrote:

Hi Scott,

RAID 5 7+1 has the highest probability of a DL due to multiple drive failure.

In very approximate terms, with the 300GB drive example and a mean time between failure (MTBF) of 200K hours, here is how the other RAID types stack up over time against RAID 5 7+1... On AVERAGE:

RAID 5 3+1 will take twice a long before a DL over 7+1;

RAID 1 will last 6 time longer;

RAID 6(14) will last 400 times longer;

RAID 6(6) will last around 2500 times longer before a DL.

Thanks Jason.  So based on this information, how long is the estimated time for a raid 7+1 failure?  If it is something like 1% over 1 year, then I can assume 3+1 would be half that, etc.  I know there might not be an answer that for this, and that to even provide one would probably require a complex calculation including frame type, number of rg's, etc..

465 Posts

February 9th, 2012 13:00

To calculate a R7+1 average life expectancy before failure, you need to make assumptions on MTBF and drive rebuild times. To put many R7+1 into a pool will likely reduce the average life expectancy. Refer to the recent discussion on SPEED relating to this. The latest version of Symmerge has a reliability calculator. You find it in the 'create target wizard' dialog. Perhaps this might be a better option for you than the manual method you are using as you can model an array and as you change the raid type configuration you get to see the impact on reliability.

19 Posts

February 10th, 2012 08:00

Quincy56 wrote:

I just uploaded a simple spreadsheet I did in a statisics class to ftp.emc.com

It is in the /incoming/jadams/RAID directory.  You can change the variables if you want.

This is great.  I just added some fixed fields that I can update for drive count, number of years, etc.  Thanks! 

As a side note, is there any way to quantify the value added by the additional technologies EMC uses like proactive sparing, error checking, etc. above the estimated/often misleading MTBF the drive vendor provides based on their own error prevention processes?  I don't think it is likely, just throwing it out there.

No Events found!

Top