This post is more than 5 years old
207 Posts
0
1022
July 14th, 2008 14:00
Raid Group and file system can't keep up
Hello,
We are running a statistical application called SAS on AIX 5.3 connected to a cx380 with 8GB cache on each SP. There is a temporary/work file system called /saswork. Users kick off huge queries and much of the sorting etc is done in /saswork. Currently this file system sits on a 3+1 RAID 5. At times the file system does over 1200 IOPS and the drives can not keep up and cache is getting flooded and I end up with way too many forced flushes. This seems to be impacting other applications that share the array.
If each of my 146GB 10K drives can do an average of 120 IOPS then I should have at least 10 drives to handle 1200 IOPS without flooding cache.
The file system is made up of 4 LUNs but LUN16 gets most of the IOs. This is the first 100 GB LUN in the file system and it seems like most jobs can get all their work done on this first LUN without needing the others. This is by far the busiest LUN on the array and it tends to bog down SPA. When I redo this I will use AIX Logical Volume striping so that IOs will be spread equally accross all four LUNs - 2 on SPA and 2 on SPB.
I have 12 available 146GB drives (also have thirty 300GB drives but don't know if I want to use those big guys as I don't need much space). Should I do a 10+1 RAID 5? That seems like a lot of drives. Should I do two seperate 5+1 RAID groups? If I get a few more drives maybe I could do two 6+1 or 7+1.
I have never used multiple RGs for one file system. Are there performance impacts? I could have LUN1 and LUN3 in the first RG and LUN2 and LUN4 could be in the second. As heavy writes are occuring they would be going to all four LUNs in both RGs. Does this make sense or should I stick to a single RG.
Thanks
Brad
We are running a statistical application called SAS on AIX 5.3 connected to a cx380 with 8GB cache on each SP. There is a temporary/work file system called /saswork. Users kick off huge queries and much of the sorting etc is done in /saswork. Currently this file system sits on a 3+1 RAID 5. At times the file system does over 1200 IOPS and the drives can not keep up and cache is getting flooded and I end up with way too many forced flushes. This seems to be impacting other applications that share the array.
If each of my 146GB 10K drives can do an average of 120 IOPS then I should have at least 10 drives to handle 1200 IOPS without flooding cache.
The file system is made up of 4 LUNs but LUN16 gets most of the IOs. This is the first 100 GB LUN in the file system and it seems like most jobs can get all their work done on this first LUN without needing the others. This is by far the busiest LUN on the array and it tends to bog down SPA. When I redo this I will use AIX Logical Volume striping so that IOs will be spread equally accross all four LUNs - 2 on SPA and 2 on SPB.
I have 12 available 146GB drives (also have thirty 300GB drives but don't know if I want to use those big guys as I don't need much space). Should I do a 10+1 RAID 5? That seems like a lot of drives. Should I do two seperate 5+1 RAID groups? If I get a few more drives maybe I could do two 6+1 or 7+1.
I have never used multiple RGs for one file system. Are there performance impacts? I could have LUN1 and LUN3 in the first RG and LUN2 and LUN4 could be in the second. As heavy writes are occuring they would be going to all four LUNs in both RGs. Does this make sense or should I stick to a single RG.
Thanks
Brad
No Events found!


RRR
6 Operator
•
5.7K Posts
0
July 15th, 2008 02:00
Correct. Bare in mind that a write I/O has a penalty of 4, so if your 1200 IOps are 75% read and 25% write, that would actually mean there will be 900 + (4 x 300) IOps, so 2100 IOps in fact ! 2100 IOps means you'll need 17.5 (so 18) drives in RAID5 to handle this. But that's an example, since I don't know the exact R/W ratio. You can enable statistics logging in Navisphere and look at each LUN, but you can also try to do some logging on the host to find out what the ratio is.
Performancewise it's a best practice to do RAID5 (4+1), but you'll end up with 2 unused drives and those that you will be using still cannot handle the amount of I/O's. Ever considered RAID10 (6+6) ? RAID10 has the advantage that the write penalty is only 2 instead of 4, so in my example of 75R/25W that would mean there are 900 + (2 x 300) = 1500 IOps. 1500 is about 1500/120=12.5 drives needed, so 14 actually, since 13 isn't allowed in RAID10.
Another way would be to create striped metaLUN's with each meta having components on each RG. This way the cumulative performance of all RG's will benefit all LUN's. So in your case with for example 2 RG's: create LUN's of half the size of LUN's 1 to 4 on RG1 and the same on RG2 and then create striped metas with for each LUN a componentt on RG1 and the other on RG2. I would "interleave" the primary component regarding the RG's. I mean: pri LUN1 is on RG1, pri LUN2 is on RG2, pri LUN3 is on RG1 and pri LUN4 is on RG2.
RRR
6 Operator
•
5.7K Posts
0
July 15th, 2008 02:00
Another advantage of having multiple RG's is that you can suffer from 2 simultanious disk failures a single disk failure will only have an impact on the RG it is in. Besides: rebuilding a RAID5 (4+1) is faster than doing it in RAID5 (8+1).
RAID10 however is even faster, since rebuilding means mirroring and not really rebuilding the data from XOR's. And in RAID10 (6+6) you can suffer from 6 disk failures, as long as they're not in the same RAID1 (RAID10 is a striped set of multiple RAID1's).
dynamox
11 Legend
•
20.4K Posts
•
87.4K Points
0
July 15th, 2008 04:00
RRR
6 Operator
•
5.7K Posts
0
July 15th, 2008 04:00
RAID5 LUN's, half the size of what's needed per LUN per RG; stripe each volume over 2 LUN's on the host
If however the RAID10 (6+6) can do the job, I'd prefer that one !
brad12341
207 Posts
0
July 15th, 2008 10:00
The RAID 5 must be encountering a significant penalty. Maybe I'm better off with the RAID 10 6+6. I just seems like it will only use the power of 6 spindles and then mirror that to another six. Am I really getting the performance of 12 spindles or just 6?
I may be able to get 4 more drives later. Can I simply add/expand the 4 drives to my 6+6 without any problems? This would give me an 8+8.
Thanks - Brad
brad12341
207 Posts
0
July 15th, 2008 10:00
It seems like I may need all these drives because of the amount of writes.
RRR
6 Operator
•
5.7K Posts
0
July 16th, 2008 00:00
RAID5: (1200 x 45%) + 4 x (1200 x 55%) = 540 + (4 x 660) = 540 + 2640 = 3180 IOps
RAID10: (1200 x 45%) + 2 x (1200 x 55%) = 540 + (2 x 660) = 540 + 1320 = 1860 IOps
Calculating with 120 IOps per 10k drive (130 or 140 are also mentioned, but let's keep on the safe side with the 120) that would mean you need:
RAID5: 3180 / 120 = 26.5 drives --> round this up to the next integer wich matches your RAID5 layout (4+1, 8+1, whatever)
RAID10: 1860 / 120 = 15.5 = 16 (8+8)
If you had 15k driives at 170 IOps:
RAID5: 3180 / 170 = 18.7 drives --> round this up to the next integer wich matches your RAID5 layout, for example 4 x (4+1) = 20 spindels
RAID10: 1860 / 170 = 10.9 = 12 (6+6)
If you start with a 6+6 now and want to expand the RG with extra drives, that's possible. The existing LUN's will be "spread" out over all drives and you'll benefit from the extra performance as soon as the RG expansion is done.
But please don't think you can use the extra space and keep the performance, since you then need to expand a LUN (create a meta) and where do you want to get the extra space from ? The same RG ? Best practice for a meta is to get the components from different RG's.
RRR
6 Operator
•
5.7K Posts
0
July 16th, 2008 05:00
There was a similar thread a few months ago as well.
The difficult part is to measure the number of IOps the host is generating over a certain amount of time. The outcome of that is your starting point. If the measurement is wrong, you'll end up buying too many or too few disks.... which in a way is good for EMC and Stefano's new car...
dynamox
11 Legend
•
20.4K Posts
•
87.4K Points
0
July 16th, 2008 05:00
brad12341
207 Posts
0
July 16th, 2008 08:00
One thing that I have learned with this issue is that it is better to do the striped meta LUNs to a different RAID group. I didn't really know that before and I'm not sure why a second RG is better. That said I will likely create smaller LUNs on my new RAID 1/0.
If I need more spindles or space later I will create similar smaller LUNs in a new RAID 1/0. I will then do a striped meta expasion of the origional smaller LUNs with the new RG LUNs. Each individual LUN would then have the origional component on the first RG and the expanded component on the second RG. It sounds like this would bring more spindles into play for each LUN.
Sound good?
brad12341
207 Posts
0
July 16th, 2008 08:00
One thing that I have learned with this issue is that it is better to do the striped meta LUNs to a different RAID group. I didn't really know that before and I'm not sure why a second RG is better. That said I will likely create smaller LUNs on my new RAID 1/0.
If I need more spindles or space later I will create similar smaller LUNs in a new RAID 1/0. I will then do a striped meta expasion of the origional smaller LUNs with the new RG LUNs. Each individual LUN would then have the origional component on the first RG and the expanded component on the second RG. It sounds like this would bring more spindles into play for each LUN.
Sound good?
RRR
6 Operator
•
5.7K Posts
0
July 18th, 2008 02:00
If you measured at the array, you will se the actuall IOps to disk, so that's the value to calculate with. Each host I/O is translated into Clariion I/O's and the IOps per disk is a known value, so gentlepeople start your math !
kelleg
6 Operator
•
4.5K Posts
0
July 18th, 2008 08:00
The reason has to do with stripping the metaLUN in the same Raid Group - you get vertically stripped patterns on the physical disk - when you write to the meta, it must access the sames disks over and over to write the data. If you use different Raid Groups, you get the benefit of over-lapped seeks and writing to more physical disks.
glen
brad12341
207 Posts
0
July 18th, 2008 14:00
Hmmm, I'd rather not use those drives for this busy RG. I do have 30 300GB drives available. Earlier I had prefered the 146GB drives as I did not need all the space on the 300GB drives. If I use 12 300GB drives I will be burning over 3TB of disk to get my 400GB file system. Now becuase of the vault drives issue I may need to use the 300GB.
This gives me some flexibility because now I'm not limited to only 12 drives. I could create two 6+6 RGs and stripe expand the luns in the first RG to the luns in the second RG using the 300GB drives.
I beleive the 146 and 300 GB drives are both 10K. Will the response time be the same for either drive size? Or will the 300GB drives be slower for some reason? If I made one RG with 146GB drives and striped expanded to luns in a 300GB RG - would that matter?
Thanks again.
AranH1
2.2K Posts
1
July 18th, 2008 15:00
The performance of 10k drives will be roughly the same regardless of the capacity. And I agree that you should not use the vault drives for the RG since you have high performance requirements and you don't want to impact the performance of the actual array as well as your application. Beyond that since the space available on each of the vault drives is reduced because of the space allocated to vault data, the capacity of the each drive in the RG will be limited as well.
I don't see a problem with using LUNs from a 146GB 10k RG and stipe them with LUNs from a 300GB 10k RG, since the performance metrics will be the same. As you pointed out there will be a lot of wasted space (assuming you don't want any other LUNs for other hosts in these RGs) but if that is acceptable to provide the performance configuration for your application then that is the cost, right?