Mabro1

666 Posts

85824

August 7th, 2012 08:00

Ask the Expert: Performance Calculations on Clariion/VNX

Performance calculations on the CLARiiON/VNX with RRR & Jon Klaus

Welcome to the EMC Support Community Ask the Expert conversation. This is an opportunity to learn about Performance calculations on the Clariion /VNX systems and the various considerations that must be taken into account

This discussion begins on Monday, August 13th. Get ready by bookmarking this page or signing up for email notifications.

Your hosts:

Rob Koper is working in the IT industry since 1994 and since 2004 working for Open Line Consultancy. He started with Clariion CX300 and DMX-2 and worked with all newer arrays ever since, up to current technologies like VNX 5700 and the larger DMX-4 and VMAX 20k systems. He's mainly involved in managing and migrating data to storage arrays over large Cisco and Brocade SANs that span multiple sites widely spread through the Netherlands. Since 2007 he's an active member on ECN and the Support Forums and he currently holds Proven Professional certifications like Implementation Engineer for VNX, Clariion (expert) and Symmetrix as well as Technology Architect for Clariion and Symmetrix.

Jon Klaus has been working at Open Line since 2008 as a project consultant on various storage and server virtualization projects. To prepare for these projects, an intensive one year barrage of courses on CLARiiON and Celerra has yielded him the EMCTAe and EMCIEe certifications on CLARiiON and EMCIE + EMCTA status on Celerra.

Currently Jon is contracted by a large multinational and part of a team that is responsible for running and maintaining several (EMC) storage and backup systems throughout Europe. Amongst his day-to-day activities are: performance troubleshooting, storage migrations and designing a new architecture for the Europe storage and backup environment.

This event ran from the 13th until the 31st of August .

Here is a summary document og the higlights of that discussion as set out by the experts. Ask The Expert: Performance Calculations on Clariion/VNX wrap up

The discussion itself follows below.

Responses(170)

etaljic81

1K Posts

1

August 29th, 2012 12:00

FAST Cache should be disabled on the online redo logs. Logs are small block sequential writes; FAST Cache will not help

vforde

67 Posts

1

August 29th, 2012 12:00

I was curious so checked it out here http://www.emc.com/collateral/software/white-papers/h8018-fast-cache-oracle-wp.pdf

It is not enabled on any of the use cases here and in the conclusion is says to Disable FAST Cache on LUNS where Online Redo reside.

Cheers,

Victor

JonK1

247 Posts

0

August 29th, 2012 13:00

Okay, so has everyone got their scuba gear on?! We're going down into the world of COFW!

The below example is for CLARiiONs running R26+.

Let's say we've got a LUN that has 500 random 4kB IOPs going to it, in a ratio of 3:1 R/W. What will be going to the RLP?

First of all, only writes will trigger a COFW. To be even more specific, only the first write to a block will trigger a copy to RLP (it's called Copy on FIRST Write for a reason ). Since the LUN is getting 100% random I/O, we can safely assume that ALL writes will trigger a COFW.

So, 125 write IOps will trigger a COFW. Now, what happens with a COFW:

- The block is read from the source LUN. This doesn't impact the RLP, so we don't care in this calculation.

- The bitmap is updated in the reserved LUN. This is one write, usually 8kB, worst case 64kB.

- The chuck index and status area (=metadata) is updated. This is also one write, usually 8kB, worst case 64kB.

- The actual block stored in the RLP. This is a 64kB write: simple!

Usually these bitmaps and status areas are paged in and out of SP memory. In real life, most of these map hits will be in memory. But since assumptions can come back to haunt you, let's assume the worst: all are going straight to disk.

Okay, so we've got 3 writes going to disk per COFW. We already knew that we have 125 writes triggering a COFW. This brings us to a grand total of 375 writes to the reserved LUN. This is "front-end" I/O, not disk I/O yet! So how many disks will you need for a RAID10 RLP?

The RAID10 write penalty is always 2. With 375 I/Os going to the RLP, we're going to need 700IOps worth of disks to fulfill this requirement. Which is in R10 15k rpm FC perspective, 3,9 drives. Since EMC sells only whole drives, build a group of 4 drives and you're done.

Easy, right? Let's go a bit deeper.

The above example works with both single sessions on a snap and with multiple sessions. This is because RAID10 has a WP of 2 for sequential and random writes; it doesn't matter. For RAID5 it does matter though!

Let's go back to the above example, a single session on the snap, and working with RAID5 in a 4+1 group. We've got 125 writes triggering a COFW. Each COFW triggers, on the reserved LUN:

- A bitmap update. This is always random.

- A metadata update. This is sequential for a single session: we can start writing at the beginning of that area and just keep on writing through for each consecutive COFW.

- A COFW. Again, for a single session we can just start at the beginning of the reserved area for COFW blocks and keep on writing.

So we have one random write to the RAID5 reserved LUN -> WP = 4 right?

We also have two sequential writes to the RAID5 reserved LUN. The WP for sequential writes to a 4+1R5 group is 1,25.

Add everything and you'll end up with a combined WP of 6,5 (4+1,25+1,25) for each COFW. So the 125 writes will trigger 813 write IOPs to the RLP. Which, in terms of 15k rpm FC drives, will cost you 4,5 drives. Round it up and you'll end up neatly with 5 drives, which is your 4+1 group. So we're already more expensive than RAID10!

Now, for the finale: let's assume we want to run multiple sessions against that snap.

- The bitmap update is already random.

- The metadata update is still sequential.

- The place where to store your COFW is now unknown. We can no longer just assume that we're filling the RL front to back. We might as well have made a couple of sessions and deleted the first session. This frees op data in the front of the RLP, so now we can write at the rear of the RLP and at the front. Where is it going? Thus -> random!

So now our combined R5 COFW write penalty will be: 4 + 4 + 1,25 = 9,25. Which, for a multi session 125IOp random write stream, will result in 1188 IOps to the reserved LUN. You'll need a nice 6,6 15k rpm FC drives for this workload. Since it's not recommended to metaLUN a RL, you'd make a bigger RAID5 group, maybe 6+1. Now we have to double back: this bigger group has now lowered your sequential R5 penalty from 1,25 (1 + 1/4 data drives) to 1+1/6 = 1,17. Let's calculate again: combined R5 COFW penalty for a multi session is 4+4+1,17 -> 9,17 WP -> 125 * 9,17 -> 1146 IOps to the reserved LUN. That is still 6,4 15k rpm FC drives, so the 7 drive RL is good to go.

So that's IOps! But we're not there yet.. what about bandwidth?! We're writing with 4kB blocks, but a COFW is always 64KB. So those 125 write IOps to the source LUN are only 500kB/s in bandwidth. But the bandwidth to the Reserved LUN is actually 125*64kB/s = 8MB/s. A factor 16x!

Moral of the story: put your RLP on RAID10! It saves you a mountain of calculations and your performance doesn't drop like a stone once someone decides to use snapshots more intensively with multiple sessions.

RRR

5.7K Posts

0

August 30th, 2012 00:00

Wow, Jon! This is a very good post on IOps calculations when used with Snapshots!!! 1 Thing puzzles me though and that's the bitmap. It's called BITmap, not 8 kB map for a reason, right? Wrong? In the bitmap only representations of the changed blocks are kept, right? So a "1" if a block changed and it's still a "0" if the source LUN still has it's original data of before the snapshot. So if a source LUN block changed I always assumed only a single bit needs to be changed. I know for a fact that I asked the very same question to my Clariion Performance Workshop instructor and I'm pretty sure the answer was that bitmap changes don't need to be taken into account. But that was in the old days with Flare 14 or 16....

I'm curious why nowadays bitmap changes are 8 kB minimum and 64 kB maximum. Can you please explain this?

JonK1

247 Posts

0

August 30th, 2012 00:00

Thanks Rob!

To be perfectly honest, I can't give you a definite answer. I dragged these block sizes straight from the Business Continuity Design book; it has been quite a while since I did those calculations! The first couple of minutes I was slightly puzzled about what was going on... long live incomprehensible notes hah! "What's that x3 doing in here?!"

My best bet is that the bitmap is paged in and out of SP memory in at least 8kB blocks, or 64kB blocks if a lot more paging is needed. Which would make some sort of sense: if you're going to perform 1 COFW, you're not going to collect JUST that 1 bit from the drives, copy it to SP mem, flip it, and then at some time flush just that 1 bit back. You'd run into a mountain of issues once you need to do many COFWs.

Remember that the bitmap can get quite large. My notes told me 0,2% of the total LUN size -> if you have a 2TB LUN that's being snapped, you've got a 4GB bitmap. No way you're going to keep that in mem all the time. And remember we're going for the worst case example here Your write cache/SP cache is 100% full, you're forced flushing, no writes are merged into full stripe writes... etc etc.

RRR

5.7K Posts

0

August 30th, 2012 01:00

That's certainly a good explanation, Jon. If in doubt I'd vote for that one

RRR

5.7K Posts

0

August 30th, 2012 03:00

Good one! Thanks for the addition!!

Mabro1

666 Posts

0

August 30th, 2012 03:00

We're in the final stretch for this discussion. It wraps tomorrow. So be sure to tie off any questions you had or if you have additional questions, get them in before the end of tomorrow . It's been a fantastic Ask the expert event with a Tweet Chat thrown in too. Well done to our experts and all who participated.

Regards,

Mark

RRR

5.7K Posts

0

August 30th, 2012 04:00

I’m sure there’s something wrong with the layout. I’m seeing the same thing. Also previously I could view the whole discussion in 1 view and now there are 8 pages

Mark will fix it, I’m sure

H

hersh1

197 Posts

0

August 30th, 2012 05:00

Jon,

In the multiple session example, if both sessions had yet to see a write to the same block would it only be 1 COFW operation when that write occurred on the block? Basically what I'm asking is can sessions reference the same COFW data to save I/O plus space in the RLP?

RRR

5.7K Posts

0

August 30th, 2012 07:00

I'm pretty sure that in the old days (Flare 14/16) for each session a COFW took place, so with 15 sessions active this could be a heavy burden. Nowadays I'm almost sure that a CX or VNX is much smarter, but I'll leave the answer to that to Jon (or others).

RRR

5.7K Posts

0

August 30th, 2012 12:00

RIght, thanks for the update, Jon.

I guess there's a lot of things we can do to enhance performance, can't we? Several, in fact!

SInce we're near the end of our session I just want to name a few and hopefully we'll get some time to continue this conversation tomorrow or else after this topic has been released into the wild ;-)

Make sure not to put all your data on just a single LUN. More LUNs mean more I/O queues, so more concurrency
Calculate what the max queue depth should be in your HBA and even perhaps your OS (VMware ESX for example)
More zones / HBAs and thus paths from your host to your storage = more I/O queues, so more concurrency
Try playing around with your cache settings in your storage array
Decide whether or not you want your storage to be capable of handleing peak I/O bursts. Not is cheaper, but you will suffer longer from the impact on performance
Make sure you use the right RAID level for your data. Once again: do the math!
Make sure the host is sized right. Put more RAM in, faster or more CPUs of spread your applications horizontally by adding more hosts to do some serious load balancing
Still need more performance? Consider METALUNs or storage pools or even FAST VP, FAST Cache or a bigger array with more cache

And what about VFcache?

JonK1

247 Posts

1

August 30th, 2012 12:00

Hi hersh,

Yes, chunks can be shared/used by multiple sessions. Of course once you wait a long time between sessions, the chance that blocks are different is quite large, so the amount of shared blocks goes down.

As for the write pattern. If you start one session, COFW chunks are written sequentially to the RLP. If you start another session, writes are still sequential. This pattern will continue until you stop or restart a session; then the pattern becomes (more) random.

--

Rob, I've checked the Performance Bible... Regarding COFW, I read the following: "Note that metadata (map) I/O's are mainly 8kB. Prior to R26 they were all 64kB in size."

JonK1

247 Posts

0

August 30th, 2012 12:00

So, we're closing this session tomorrow. But we've time for one insane calculation...

Let's assume that we have two CLARiiONs running MirrorView/S on FLARE29. We've got two primary LUNs on one system, 512GB each. The systems are connected with a 10Mbit dedicated line which has a 2:1 compression ratio. The primary LUNs have a constant workload of 400 IOps in total, 100% random, 8kB in size, R:W of 3:1.

Question: how long can we fracture the mirrors if we want them to resync in max 2 hours?

Let's start with the link. We've got 10Mbit. We can compress 2:1, so actually we can push 20Mbit across the line. Here comes surprise one: FC uses 10 bits for each byte. So instead of 20/8*1024=2560KB/s usable bandwidth, we actually only have 20/10*1024=2048KB/s bandwidth.

So let's think. If we fracture, we start piling up work that we need to sync once we resume mirroring. But the normal workload will also keep on coming, so we can't use the full 2048KB/s to sync up!

The normal workload will be 100IOps * 8kB = 800kB/s. Which means we have 2048-800= 1248kB/s to re-sync.

The customer allows us to re-sync for 2 hours. Which means we can move 7200 seconds worth of data, which is 8985600 kB. (8,56GB)

So now we only need to find out how much data we need to resync for each second we're fractured. Here comes surprise #3. Who's thinking 100*8 kB = 800kB for each second? WRONG!

Remember the fracture log? It's limited in size, so we end up with the same issue as COFW: we can't always offer the granularity of tracking as we would want to. For R28 the fracture log (FL) is 32 kB, for R29 it's a whopping 256 kB. We're running R29, which means we have 2Mbit of pointers (=2097152 pointers). Our source LUNs are 512GB (=536870912 kB). Thus, each pointer tracks 256 kB of data. We've got 100% random writes, so each write will taint a new 256 kB tracking bit. 100 write IOps of 8 kB therefore generate 256000 kB of data per second that needs to be synced.

Since we have a budget of 8985600kB that we can sync in 2 hours, we can now calculate the time that we can be fractured: 35,1 seconds! Not as much as you would have thought right?!

Calculate for yourself what your time budget would be with R28... if you ever need a good reason to follow the FLARE releases, here's one!

JonK1

247 Posts

0

August 30th, 2012 13:00

So, we're almost out of time. These three weeks were awesome! Thanks for asking all your questions and allowing us to try and answer them. It certainly was a lot of fun and sometimes even a nice challenge for Rob and me...

Also, a big cheers to Mark for giving us this thread to play around in and starting the first ever EMC TweetChat. THANK YOU!

If you want to learn more about these kind of calculations for yourself with EMC Education, I can highly recommend ANY training Stephen Stead gives, especially the Performance Workshop and Business Continuity training. Believe me: I was EXHAUSTED after the Business Continuity training! The pace is high, you're challenged with good questions... hard work but very rewarding!

Thanks all!

If you have any burning questions, get them in asap. And otherwise... see you on ECN or maybe on EMCWorld!!

Jon

1
6
7
8
9
10
12

View All

No Events found!

Enterprise Support

Ask the Expert: Performance Calculations on Clariion/VNX

Performance calculations on the CLARiiON/VNX with RRR & Jon Klaus