Write Performance Raid 5

Question

i have read Performance calculation for Clarrion Flare 30:

Page 99:http://www.emc.com/collateral/hardware/white-papers/h5773-clariion-best-practices-performance-availability-wp.pdf

id like to make me an idea how to explain the Write Performance in a Clariion Storage:

if we have a Raid 5 Raid Group: (4+1 Disks)

a workload of, for example, 200 IOPS generates on the disks a load of 200*4= 800 IOPS:

i thought about that, my idea is as follows:

every write process writes 64k to one of the 5 Disks, another 3 Disks have to read there data for parity calculation which is written to the last disk: so for the write process of 64k all 5 disks have do move there header, reading and writing.

If that is the case: i d like to know: why is the write cache not collecting 4*64k = 256k and writes it simultaneously to 4 disks and the calculated parity to the last disk: so the workload 200 IOPS could theoretically be: (200*5)/4= 25O IOPS on the disks, not 800 IOPS.

So can you please answere me if my theory is right, and why the write cache can not improve the performance?

How does this scale if we use spanned metalun 20 Disks: 4* (4+1 Raid Groups)

our performance tests confirms that sequential writing 64k is as fast as writing to 5 Disks (4+1),

the 20 Disks only had a better performance on random i/o: how could that be explained? Is the write cache only working on random i/o?

Sridhar246 · Answer

Everything below is my understanding of the theory...

First, Raid 5 is a distributed parity, which means for each lun that you bind on the raid group the parity will sit on different disk and not just 1 disk that would be the case of Raid 3. (refer to page 50 of the document you linked above)

Second, IO is spilt evenly between all the drives in a raid group, so there is no concept of first 64k data going to drive 1 and such, if the SP is writing 64k of data to the RG then it is divided between the 4 data drives evenly (64/4 = 16k) and the parity on the parity drive.

The way I understand how IO is written on backend for a Raid 5 RG is as follows:

1. SP reads the data from data drives

2. SP reads the parity from parity drive (again parity is distributed)

3. SP recalculates the parity based on the new information

4. SP writes data to data drives

5. SP writes the parity bit to the parity drive

Steps 1, 2, 4, 5 above involve backend disk accesses which is why you see a write penalty of 4 for R5. This is the reason why if you get 200 IOPS on the front end your raid group should be able to handle 200*4 = 800 IOPS.

When designing your backend, its a good idea to not consider any performance improvements that the cache might bring in, since ultimately your disks which contain the final data should be able to cope up with the FE IO that the box is getting.

How write cache helps in write scenarios, in case of R5 (4+1) 4 disks are data drives, which means a full stripe write will be 4*64k = 256k so clariion if possible will try to write IO in chunks of 256k to the backend disks.

Again, all of the above is my understanding (which can be wrong) so wait for some of the expert's comments

Storagesavvy · Answer

There are several questions here and some of this is covered in the best practices guides about the benefits of write cache and the differences between Random and Sequential IO…

Random IO – Since the location being updated is not sequential, the write cache caches the write and later performs the read from disk, calculation of parity, and write back to disk. This action of the write cache accelerates random write IO since, without the cache, the host would have to wait for the 4 disk IO’s to complete before moving on.

Since EMC storage systems store data in the Logical address that the host requested, there would be no way to coalesce random writes and stripe them to disk without the parity calculation, since each write could go to entirely different RAID stripes that could have varying stripe widths (64KB-960KB) and protection types (RAID1, RAID5, RAID6).

For Sequential IO – the array DOES fill up the cache stripe, calculate parity on the fly, and then write the data to disk once – assuming the workload is truly sequential and aligned with the stripe width. This function of array cache allows for accelerating sequential writes. Since Sequential writes don’t require reads to calculate parity, they can perform at about the same level as Sequential Reads up to the limit of the storage processor bandwidth and CPU available. Writes are mirrored between storage processors so there is extra CPU overhead and memory bandwidth required making writes more “expensive” than reads.

Writing data Sequentially in increments of 64KB (or a multiple of 64KB) will net the best overall write bandwidth (MB/s) from a CLARiiON or VNX. However, if you’re application is OLTP or other random write IO, typically the IO size will be less than 64KB. Small Block Random IO will provide the best overall throughput (IOPS) from the array. Throughput and Bandwidth are inversely proportional.

For MetaLUNs, each component LUN has a stripe and then there is a stripe multiplier. So let’s look at your Striped Meta of 20 disks (4 x R5-41) with a Stripe Multiplier of 4 (default and recommended for R5-41)…

4+1 yields a stripe width of 256KB (4 x 64KB)

Stripe Multiplier of 4 = MetaLUN stripe of 1MB (4 x 256KB)

So a sequential write activity of 1MB of data will end up as 4 stripes on the first 41 component. The next 1MB of data will end up on the next component, etc. For a single threaded sequential application, performance will be essentially equal to a single 41 RAID Group. If, however, you have multiple threads, each reading or writing sequential streams, you could achieve higher throughput. For the same reason, Random small-block IO would invariably be hitting various components of the MetaLUN at the same time, achieving higher performance due to the higher number of physical disks being accessed simultaneously.

The choice of MetaLUN or non-MetaLUN depends a lot on the workload expected, and the application/filesystem being used. For example, if you were deploying Oracle ASM, I would recommend creating multiple RGs, one LUN per RG, and then let ASM stripe across each LUN.

Hope that helps somewhat

Richard J Anderson

Storagesavvy · Answer

One correction to this…

For CLARiiON/VNX, 64KB does indeed go to a single spindle. If 64KB of data was written, and the adjacent 64KB blocks in the same stripe have not been modified, the array will then read the remaining 64KB blocks of the stripe from disk, read the parity, update the stripe with the newly written 64KB, recalculate parity, then write the new stripe and parity back to the disk. This process is why RAID5 has a 4X write penalty.

However, if all 64KB blocks of a stripe are modified (as in a Sequential write operation) then there is no need to read the old blocks from the disk, so the parity is calculated from the data already in cache and written to disk, avoiding the penalty.

Richard J Anderson

CLARiiON

Write Performance Raid 5

Was this post helpful?