andyxzyk1

44 Posts

1741

August 12th, 2010 06:00

MetaLUN Performance Questions -

Hoping some performance trained person can answer this questions for us.

Ok, so lets assume I have 5 RAID10 (4+4) RAID Groups, of 450GB (400GB formatted) spindles.

Lets now split each RG into 5 luns (numbers above are due to our capacity planning).

Let now make a MetaLUN out of each of these 5 splits.

So each lun will be ( 400 x 4 all divided by 5) = 320GB.

Each MetaLUN will be 320x5=1.6TB

Now I write 64KB to the MetaLUN, with the setup above do i end up with data traversing (i.e. mis-alignment).

Will this 64KB be striped across metalun members and furthermore across raid group members, so you end up with 64/5 to each metalun member and 64/5/5 to each spindle ?

I know that even numbers (e.g. 4,8) will probably give me better alignment (because of the 64KB stripe size on Clariion), but this question is more because (a) for capacity reason I cannot do this and (b) because I would like to better understand how data is actually laid down to the disks.

thanks in advance for any help.

Responses(12)

kelleg

4.5K Posts

0

August 13th, 2010 12:00

With a 4+4 raid 1\0, the stripe size is 256KB - you don't count the secondary disks (64KB for each of the 4 disks - 64KB * 4).

The stripe size of the first metaLUN component is 256KB. You then multiple the first metaLUN component LUN's stripe size (256KB) times the metaLUN Stripe Multiplier (4 is the default) = 1MB.

So when you write to the metaLUN, you will write into the first metaLUN component 1MB, then move to the second component LUN and write 1MB, then to the third, fourth and fifth. Each component in the metaLUN will get 1MB before moving to the next component LUN.

Within the component LUN, you still will write the 1MB of data in 64KB chunks to each disk - so 1MB will be distributed across four disks in 64KB chunks four times.

I can't tell you where each block of date will reside on each individual disk as that is determined by the host filesystem. If you write sequential data into the LUN one 64KB chuck at a time, the first 64KB will go into the first component LUN into the first disk. If you then write a second 64KB, it should be the first 64KB in the second disk in the first component LUN. This would happen until you reach 1MB and then you would start writing into the second component LUN.

Does this help?

jps00

392 Posts

0

August 12th, 2010 07:00

From a purely performance standpoint, you'd want to use 4x RAID groups to ensure you have a 1024 KB MetaLUN stripe. If you're on Windows 2K8 R2, alignment will be handled by the host.

How that 64KB will be written will depend on the I/O. Fully cached, it it could be written out as a full stripe write. Worst case its a read, modify write to a single RAID group.

The EMC CLARiiON MetaLUNs paper has been recently rewritten to include performance and availability information, you might want to review it. Its available on Powerlink.

A

andyxzyk1

44 Posts

0

August 12th, 2010 07:00

thankyou for your answer but as stated in my question, for capacity reasons I have to go with this configuration. That is why I need to understand more how the data is put on the disks when METALuns are used.

I will search for the document you pointed to. I have the 'old' MEtaLUN paper and that doesn't answer my questions.

kelleg

4.5K Posts

0

August 12th, 2010 08:00

The document is located in the Document section of the forum:

White Paper EMC CLARiiON MetaLUNs - A Detailed Review.pdf

You want to see the section about the metaLUN stripe - the amount a data written on each disk depends on the number of component LUNs, the size of the stripe and the metaLUN stripe multipler that you choose - it's a bit complicated.

glen

driskollt1

131 Posts

0

August 12th, 2010 12:00

I might be wrong on this, but I'm pretty sure I'm right. I'm sure someone will correct me.

Provided you aren't doing huge writes, or have write cache disabled, all writes will go to cache first and not disk.

64K is what it writes onto each disk. In a no caching perfect world a 64K write would go to 2 disks in RAID Group 1 (1 disk + Mirror), the next 64K write would go to the next pair of disks in the same RAID Group. Once all the disks in the RG have been written to, it would move onto the next component LUN of the MetaLUN.

With caching in the picture...

All the writes to the LUN go to the cache. The cache then fills up the cache pages with all of these writes (Sequential writes). Then the cache will dump the writes to disk in larger writes (larger writes means less IOPS).

The CLARiiON is all about 64k. It writes 64k before moving to the next disk. Even if you're doing a bunch of 8k writes. With cache - it gathers them all up so it can send 64k to that disk (and most likely multiple 64k writes to multiple disks - hopefully full stripe writes). Without cache it would do 8 8k writes to the disk before moving to the next disk.

Not sure if I answered your question. If I didn't feel free to yell at me.

A

andyxzyk1

44 Posts

0

August 13th, 2010 02:00

thanks everyone whose had a stab at answering this so far.

I am doing more reading up because so far my question has not been answered.

64K write chunks is not always true. It depends on the element size. If you leave the element size at 128K each made up of 512bytes, then you get your 64K.

So the 64K can change.

Cache coalescing is also something that happens in the background. The cache 'attempts', and does not always succeed, in coalescing several writes into one full stripe width and then do one write to the disks instead of many. Again this is not always successful but Clariion does do a very good job at it.

My question acknowledged that in a perfect world with number of drives being a multiple (or sub-multiple) of 64K (because we leave everything at defaults) we would work better, however, I explained why we choose the config we choose.

Not sure I was clear enough on the config, let me try again (see jpg):

4+4 RAID10 400GB RAID Group =1.6TB (only showing data drives above).

Bind 5 LUNS from this RD group, so we get 5x320GB LUNs (Lun1 to Lun5).

Now do the above on 4 more RGs, i.e. so we end up with a total of 5RGs, this give us 25x320GB luns.

Select a LUN from each RG and make into a MetaLun, so we end up with 5 MetaLuns, each MetaLUN is 5x320=1.6TB

Now disable all caching (RD and WR) for each LUN, therfore METALun also.

I now write my 64K to MetaLUN1, does this write go to each of the LUN1s making the MetaLun1 ? if so since I have 5 members making up this MetaLUN, and 64/5 is not a whole number, do I end up traversing luns in the MetaLUN for this write ? i.e am i ending up with doing more writes than if i had a MetaLUN made up of members that divide wholly into a 64K ?

Please note this question is not about how to design a good implementation, rather about how writes go onto MetaLUNs with lun RD and WR caching disabled and element size at default.

RRR

2 Intern

•

5.7K Posts

1

August 13th, 2010 03:00

1 Remark on you LUN's: create each META head on the next Raid Group. This way the load on the beginning of each META LUN is on a different RG for each META LUN. A sort of rotational META LUN setup if you know what I mean.

So:

component 1 of META 1 starts on RG 1

component 1 of META 2 starts on RG 2

component 1 of META 3 starts on RG 3

component 2 of META 1 starts on RG 2

component 2 of META 2 starts on RG 3

component 2 of META 3 starts on RG 1

component 3 of META 1 starts on RG 2

and so on....

RRR

2 Intern

•

5.7K Posts

0

August 13th, 2010 04:00

You're welcome

A

andyxzyk1

44 Posts

0

August 13th, 2010 04:00

RRR, i like that idea. I think I'll use it. Hope you don't want any money for it

kelleg

4.5K Posts

0

August 13th, 2010 08:00

In the White Paper - see page 20 at the top of the page - this explains how to figure out the amount a data written to each component LUN in a metaLUN.

Click on the image below - sorry about the quality

glen

A

andyxzyk1

44 Posts

0

August 13th, 2010 09:00

thankyou, that was a helpful answer, but i already downloaded that paper and read it in its entirety (quite a long read may i add).

That tells me strip width unfortunately i'm after a little more detail than that, or perhaps i do not understand it fully.

In my diagram, I have 5RG each with 5lun. Since I have RG in a 4+4, let call them slice (s) 1 of RG1-lun1, s2 of RG2-lun1 etc.. for abbrevation lets adopt this terminoloy. So RG1 has s1RG1L1,s2RG1L1, s3RG1L1 and s4RG1L1, making up LUN1, up to s1RG1L5, s2RG1L5, s3RG1L5 and s4RG1L5 and the last RG, RG5, would have the last lun on s1RG5L25 to s4RG5L25.

Looking at my diagram again METALUN1 would be made up of s1RG1L1,s1RG2L6, s1RG3L11, s1RG4L16 and s1RG5L20.

When i write my 64K file does it only write to s1RG1L1 or does it write 64k/5 to each component ?

If it write the 64K on s1RG1L1 only what happens on the next write ? assuming it is say, 75K, does it write another 64K this time to s1RG2L6 and 11K to s1RG3L11?

If it does distribute the first 64K across all components, then does it write 64K/5 to each of the components (slices) in the MetaLUN ?

I'm not even convinced that with the arrangement shown in the diagram I'm actually going to win anything in the way of performance ?

Hope you can see what i'm trying to get answered.

A

andyxzyk1

44 Posts

0

August 14th, 2010 06:00

thankyou Kelleg, I have submitted a 'corrrect answer' on this because it answers my question.

Looks like with 5 components in the metalun we will end up with data traversing the components, if no cache coalescing is in effect (i.e. if i disable write caching for the components). The underlying RG making up the metalun components are 4+4, but the metalun components themselves are 5 NOT 4.

I think I have to come up with a better design that satisfies our capacity requirements, or perhaps play with the metalun multiplier.

Back to the drawing board !

View All

No Events found!