This post is more than 5 years old

207 Posts

1022

July 14th, 2008 14:00

Raid Group and file system can't keep up

Hello,

We are running a statistical application called SAS on AIX 5.3 connected to a cx380 with 8GB cache on each SP. There is a temporary/work file system called /saswork. Users kick off huge queries and much of the sorting etc is done in /saswork. Currently this file system sits on a 3+1 RAID 5. At times the file system does over 1200 IOPS and the drives can not keep up and cache is getting flooded and I end up with way too many forced flushes. This seems to be impacting other applications that share the array.

If each of my 146GB 10K drives can do an average of 120 IOPS then I should have at least 10 drives to handle 1200 IOPS without flooding cache.

The file system is made up of 4 LUNs but LUN16 gets most of the IOs. This is the first 100 GB LUN in the file system and it seems like most jobs can get all their work done on this first LUN without needing the others. This is by far the busiest LUN on the array and it tends to bog down SPA. When I redo this I will use AIX Logical Volume striping so that IOs will be spread equally accross all four LUNs - 2 on SPA and 2 on SPB.

I have 12 available 146GB drives (also have thirty 300GB drives but don't know if I want to use those big guys as I don't need much space). Should I do a 10+1 RAID 5? That seems like a lot of drives. Should I do two seperate 5+1 RAID groups? If I get a few more drives maybe I could do two 6+1 or 7+1.

I have never used multiple RGs for one file system. Are there performance impacts? I could have LUN1 and LUN3 in the first RG and LUN2 and LUN4 could be in the second. As heavy writes are occuring they would be going to all four LUNs in both RGs. Does this make sense or should I stick to a single RG.

Thanks

Brad

6 Operator

 • 

5.7K Posts

July 15th, 2008 02:00

If each of my 146GB 10K drives can do an average of 120 IOPS then I should have at least 10 drives to handle 1200 IOPS without flooding cache.


Correct. Bare in mind that a write I/O has a penalty of 4, so if your 1200 IOps are 75% read and 25% write, that would actually mean there will be 900 + (4 x 300) IOps, so 2100 IOps in fact ! 2100 IOps means you'll need 17.5 (so 18) drives in RAID5 to handle this. But that's an example, since I don't know the exact R/W ratio. You can enable statistics logging in Navisphere and look at each LUN, but you can also try to do some logging on the host to find out what the ratio is.

I have 12 available 146GB drives (also have thirty 300GB drives but don't know if I want to use those big guys as I don't need much space). Should I do a 10+1 RAID 5? That seems like a lot of drives. Should I do two seperate 5+1 RAID groups? If I get a few more drives maybe I could do two 6+1 or 7+1.


Performancewise it's a best practice to do RAID5 (4+1), but you'll end up with 2 unused drives and those that you will be using still cannot handle the amount of I/O's. Ever considered RAID10 (6+6) ? RAID10 has the advantage that the write penalty is only 2 instead of 4, so in my example of 75R/25W that would mean there are 900 + (2 x 300) = 1500 IOps. 1500 is about 1500/120=12.5 drives needed, so 14 actually, since 13 isn't allowed in RAID10.

I have never used multiple RGs for one file system. Are there performance impacts? I could have LUN1 and LUN3 in the first RG and LUN2 and LUN4 could be in the second. As heavy writes are occuring they would be going to all four LUNs in both RGs. Does this make sense or should I stick to a single RG.


Another way would be to create striped metaLUN's with each meta having components on each RG. This way the cumulative performance of all RG's will benefit all LUN's. So in your case with for example 2 RG's: create LUN's of half the size of LUN's 1 to 4 on RG1 and the same on RG2 and then create striped metas with for each LUN a componentt on RG1 and the other on RG2. I would "interleave" the primary component regarding the RG's. I mean: pri LUN1 is on RG1, pri LUN2 is on RG2, pri LUN3 is on RG1 and pri LUN4 is on RG2.

6 Operator

 • 

5.7K Posts

July 15th, 2008 02:00

Should I do two seperate 5+1 RAID groups ?


Another advantage of having multiple RG's is that you can suffer from 2 simultanious disk failures a single disk failure will only have an impact on the RG it is in. Besides: rebuilding a RAID5 (4+1) is faster than doing it in RAID5 (8+1).

RAID10 however is even faster, since rebuilding means mirroring and not really rebuilding the data from XOR's. And in RAID10 (6+6) you can suffer from 6 disk failures, as long as they're not in the same RAID1 (RAID10 is a striped set of multiple RAID1's).

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

July 15th, 2008 04:00

Rob ..so if Brad wants to do host based striping ..wouldn't he end up with un-advised configuration where he is using raid-5, creates striped MetaLun and then stripes at the host ?

6 Operator

 • 

5.7K Posts

July 15th, 2008 04:00

True. Ahhh.... I see: So I'd advise:
RAID5 LUN's, half the size of what's needed per LUN per RG; stripe each volume over 2 LUN's on the host

If however the RAID10 (6+6) can do the job, I'd prefer that one !

207 Posts

July 15th, 2008 10:00

A little more info - Because this is a "work" file system it is a little strange. It looks like 55% writes and 45% reads. I'm guessing that an inital step writes stuff out to /saswork where it is sorted and manipulated for use later. Each job usually cleans up/deletes it's files at the end.

The RAID 5 must be encountering a significant penalty. Maybe I'm better off with the RAID 10 6+6. I just seems like it will only use the power of 6 spindles and then mirror that to another six. Am I really getting the performance of 12 spindles or just 6?

I may be able to get 4 more drives later. Can I simply add/expand the 4 drives to my 6+6 without any problems? This would give me an 8+8.

Thanks - Brad

207 Posts

July 15th, 2008 10:00

Also, once I have my 8+8 - does it make sense to add a second 8+8 RG using some of the 300GB drives if analyzer showed that I still need more dirves? Maybe I would bind new luns in the 300GB drive RG and then do striped meta expansion of the original 8+8 luns.

It seems like I may need all these drives because of the amount of writes.

6 Operator

 • 

5.7K Posts

July 16th, 2008 00:00

These 1200 IOps you are talking about, where did you measure them ? On the host or on the Array ? If you did it on the host and you have 55% writes and 45% reads, the outcome would be:
RAID5: (1200 x 45%) + 4 x (1200 x 55%) = 540 + (4 x 660) = 540 + 2640 = 3180 IOps
RAID10: (1200 x 45%) + 2 x (1200 x 55%) = 540 + (2 x 660) = 540 + 1320 = 1860 IOps

Calculating with 120 IOps per 10k drive (130 or 140 are also mentioned, but let's keep on the safe side with the 120) that would mean you need:
RAID5: 3180 / 120 = 26.5 drives --> round this up to the next integer wich matches your RAID5 layout (4+1, 8+1, whatever)
RAID10: 1860 / 120 = 15.5 = 16 (8+8)

If you had 15k driives at 170 IOps:
RAID5: 3180 / 170 = 18.7 drives --> round this up to the next integer wich matches your RAID5 layout, for example 4 x (4+1) = 20 spindels
RAID10: 1860 / 170 = 10.9 = 12 (6+6)


If you start with a 6+6 now and want to expand the RG with extra drives, that's possible. The existing LUN's will be "spread" out over all drives and you'll benefit from the extra performance as soon as the RG expansion is done.
But please don't think you can use the extra space and keep the performance, since you then need to expand a LUN (create a meta) and where do you want to get the extra space from ? The same RG ? Best practice for a meta is to get the components from different RG's.

6 Operator

 • 

5.7K Posts

July 16th, 2008 05:00

I hope it helps.
There was a similar thread a few months ago as well.

The difficult part is to measure the number of IOps the host is generating over a certain amount of time. The outcome of that is your starting point. If the measurement is wrong, you'll end up buying too many or too few disks.... which in a way is good for EMC and Stefano's new car... ;)

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

July 16th, 2008 05:00

Rob ..thank you for showing really detailed calculations ..very helpful !!

207 Posts

July 16th, 2008 08:00

I've been measuring IOPS and KB per second at the array with Navisphere Analyzer. When looking at just IOPS they are a little higer on reads than writes. But looking at KB per second writes is higher. It seems to be doing larger block writes than reads.

One thing that I have learned with this issue is that it is better to do the striped meta LUNs to a different RAID group. I didn't really know that before and I'm not sure why a second RG is better. That said I will likely create smaller LUNs on my new RAID 1/0.

If I need more spindles or space later I will create similar smaller LUNs in a new RAID 1/0. I will then do a striped meta expasion of the origional smaller LUNs with the new RG LUNs. Each individual LUN would then have the origional component on the first RG and the expanded component on the second RG. It sounds like this would bring more spindles into play for each LUN.

Sound good?

207 Posts

July 16th, 2008 08:00

I've been measuring IOPS and KB per second at the array with Navisphere Analyzer. When looking at just IOPS they are a little higer on reads than writes. But looking at KB per second writes is higher. It seems to be doing larger block writes than reads.

One thing that I have learned with this issue is that it is better to do the striped meta LUNs to a different RAID group. I didn't really know that before and I'm not sure why a second RG is better. That said I will likely create smaller LUNs on my new RAID 1/0.

If I need more spindles or space later I will create similar smaller LUNs in a new RAID 1/0. I will then do a striped meta expasion of the origional smaller LUNs with the new RG LUNs. Each individual LUN would then have the origional component on the first RG and the expanded component on the second RG. It sounds like this would bring more spindles into play for each LUN.

Sound good?

6 Operator

 • 

5.7K Posts

July 18th, 2008 02:00

Yup. More spindels, means more IOps capability per LUN. So stripe them over as many disks as possible.

If you measured at the array, you will se the actuall IOps to disk, so that's the value to calculate with. Each host I/O is translated into Clariion I/O's and the IOps per disk is a known value, so gentlepeople start your math !

6 Operator

 • 

4.5K Posts

July 18th, 2008 08:00

"One thing that I have learned with this issue is that it is better to do the striped meta LUNs to a different RAID group. I didn't really know that before and I'm not sure why a second RG is better."

The reason has to do with stripping the metaLUN in the same Raid Group - you get vertically stripped patterns on the physical disk - when you write to the meta, it must access the sames disks over and over to write the data. If you use different Raid Groups, you get the benefit of over-lapped seeks and writing to more physical disks.

glen

207 Posts

July 18th, 2008 14:00

Well, one more thing just came up as I started to create my new RG with the 12 available 146GB drives. It looks like 4 or them might be vault drives - 000, 001 etc.

Hmmm, I'd rather not use those drives for this busy RG. I do have 30 300GB drives available. Earlier I had prefered the 146GB drives as I did not need all the space on the 300GB drives. If I use 12 300GB drives I will be burning over 3TB of disk to get my 400GB file system. Now becuase of the vault drives issue I may need to use the 300GB.

This gives me some flexibility because now I'm not limited to only 12 drives. I could create two 6+6 RGs and stripe expand the luns in the first RG to the luns in the second RG using the 300GB drives.

I beleive the 146 and 300 GB drives are both 10K. Will the response time be the same for either drive size? Or will the 300GB drives be slower for some reason? If I made one RG with 146GB drives and striped expanded to luns in a 300GB RG - would that matter?

Thanks again.

2.2K Posts

July 18th, 2008 15:00

Brad,
The performance of 10k drives will be roughly the same regardless of the capacity. And I agree that you should not use the vault drives for the RG since you have high performance requirements and you don't want to impact the performance of the actual array as well as your application. Beyond that since the space available on each of the vault drives is reduced because of the space allocated to vault data, the capacity of the each drive in the RG will be limited as well.

I don't see a problem with using LUNs from a 146GB 10k RG and stipe them with LUNs from a 300GB 10k RG, since the performance metrics will be the same. As you pointed out there will be a lot of wasted space (assuming you don't want any other LUNs for other hosts in these RGs) but if that is acceptable to provide the performance configuration for your application then that is the cost, right?
No Events found!

Top