Benny4

17 Posts

2190

December 13th, 2011 07:00

Performance metrics @ Disk / LUN / RG / Meta

I wanted to open this topic to hopefully get a little more clarity around the area of performance monitoring on the Clariion.

These are the metrics that I was told to use as a bench mark from my performance tuning class:

Utilization = 70%

Throughput = 100

Bandwidth mbps = 10

Responsetime = 20

Avg Queue Length 12 or 2 per disk

Now when I look at this through analyzer I see areas for concern at a LUN / Meta / RG level (high response times, IO) but if I expand this and look at a disk by disk level things look good. Shouldnt analyzer adjust the metrics based on the number of disks behind each LUN / Meta / RG? It can make for some really concerning metrics otherwise. Or is there a better tool to analyze this information?

Thanks

Responses(9)

kelleg

4.5K Posts

0

December 13th, 2011 13:00

When using Analyzer you will find that there's a lot of data. Some may be useful, some not so useful.

When you're monitoring performance, the real key will be "how do my hosts perform". You know you have performance issues when you get calls from your users in the middle of the night.

So, what should you look for? The first place I recommend is look at the LUN(s) for the host experiencing problems. See what the response times are. This is the best metric to quickly see if there is a problem. If this is a problem, then you can use all the other metrics on the array to try to detemine why that particular LUN is slow.

glen

Benny4

17 Posts

0

December 14th, 2011 07:00

Glen,

Thanks for the reply. When I look at my LUN is see area for concern given the following metrics.

Utilization = 70%

Throughput = 100

Bandwidth mbps = 10

Responsetime = 20

Avg Queue Length 12 or 2 per disk

However when I look at these metrics at a disk level everything looks great. I looked at perfmon on the host and see response times of 4ms. However when I look at the LUN I see 25-30. Then when I expand down to the individual disks I see between 2-4ms. So is the LUN calculating this by response time * number of disks?

This would appear to be a performance related issue from the LUN level but we are actually getting blazing fast performance from our host.

Thoughts

-Benny

kelleg

4.5K Posts

0

December 14th, 2011 09:00

The 70% Utiliztion figure is where engineering determined that you get the best performance and the best resposne times - higher utilization could increase response time.

To your specific question: Why is LUN response time 25-30ms when the disks are 2-4ms and the host is reporting 4ms.

The answer is probably caching. If you're doing a lot of Writes, the Write first get sent to the Write cache and once the Write is mirrored to the mirrored cache, the host gets an acknowledgement - so the host sees fast reponse times. Now that write is setting in cache and needs to get sent to the disks at some point. But if another Write comes in that covers the same data, you start the process over again - the writes goes into write cache and is mirrored, the ACK gets sent to the host and the write is pending in cache waiting to send to disk. So the LUN is reporting higher response times (and there can be other reasons for this), there is no data yet sent to the disks and the disks are reporting low response times.

It could be this or some other things that I've seen, but this is the most likely - caching has a huge impact on performance.

glen

kelleg

4.5K Posts

0

December 14th, 2011 10:00

As an added thought, those metrics from your class - these are the normal "Thresholds" that you can set in Analyzer for the Summary view - when these numbers are exceeded, the box representing each of these values will turn a different color - this make is easier to see which LUNs are exceeding the limits you set - (threshold). The IOPS = 100 is a good number, but you may want to use the numbers in the Best Practice Guides: 10K FC = 120 IOPS, 15K FC = 180 IOPS, SATA = 60 IOPS. You can leave the MB/s blank if your IO load is mostly IOPS.

glen

SteveZhou

136 Posts

0

December 14th, 2011 23:00

Hi Glen,

As i know, LUN response time = cache response time. Imagine that when an I/O comes to the LUN (cache actully), then the lun will ACK the host once mirrored to peer cache. I have been always thinking that it works in this wa....but today when i see your input, it looks like i was wrong. Lun response time = the time it takes to go to the backend disk. But the request is from the host not the disk, why LUN response time was counted by this way?

Your confirmation and clarification will be very appreciated.

Steve

kelleg

4.5K Posts

0

December 15th, 2011 09:00

It's actually more complex than that - LUN response time is also based on Queue Length - Response time and Queue Length are directly proportional - higher queuing = higher respone time. My statement above is a simplification and more directed at the specific question asked about the difference between LUN IOPS and Disk IOPS and how this can be different.

If you want, contact me off-line. You also might want to check with Jacob there about this.

glen

SteveZhou

136 Posts

0

December 15th, 2011 21:00

i will consult Jacob first and go to your side if necessary, thank you, Glen.

SKT2

1.3K Posts

0

January 3rd, 2012 17:00

Benny,

LUN response time NOT= (disk respons time*number of disks)

SKT2

1.3K Posts

0

January 5th, 2012 17:00

Steve, what did u find from Jacob? I see the thread still unanswered

View All

No Events found!

CLARiiON

Performance metrics @ Disk / LUN / RG / Meta