Start a Conversation

Unsolved

This post is more than 5 years old

2022

February 23rd, 2017 15:00

Probably of data loss in a raid-1 multiple drive failure event

Lets say I'm worried if multiple drives fail after we power cycle an array - the disk pool is comprised of raid-1 pairs and for calculation, let's say I have 600 drives resulting in 300 pairs. If two fail in succession, I have a 1 in 598 chance that the second failure happened to the drive that was left unprotected due to the first failure, and thus data loss. If I have a third drive fail before the first two mirrors are rebuilt via spares, then my chance of data loss on that third event is 2 of 597 as I have two drives unprotected at that point, and so on for additional failures until my rebuilds complete. So for the stats folks out there, How do I figure the total probability once N drives fail before rebuilds resulting in data loss, basically calculating the net total risk after the series of failures?

465 Posts

February 23rd, 2017 20:00

You can have a chat with your local Dell EMC account representative and they will be able to provide you with an availability estimate on your (VMAX) configuration. They can calculate a disk pool availability number in number of 9's. (using the number of spares and drive rebuild times etc.)

For an example, 600X 600GB 10K drives in RAID1 is in the very high 6 nines (almost 7 nines). e.g. 99.999984% availability. Well and truly in the comfort zone :-)

9 Posts

February 24th, 2017 07:00

Yes, those usual EMC availability numbers are usually quite comforting. But I'm looking at an instance where 10 or 20 drives could be not ready or be failed out directly after a power-up before rebuilds are complete due to a combination of usual mortality, a known FCO issue on a disk type, or a firmware scan discovering errors that fails out drives.

Sorry for the long sentence above, but I'm just looking for the probability of X failures in N drives in a RAID-1 pool in where both members of a mirrored pair are affected. Regardless of other availability numbers, I know this should be considered for analysis based upon a situation that I am aware of (which I am not yet ready to discuss).

No Events found!

Top