Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

2683

December 1st, 2010 10:00

Clariion Performance Metrics to Focus On

Hello All,

I'm working to build an initial performance review for some CX3-80 arrays.  All of the disks are I've gathered Navisphere Analyzer data for a week and I'm now processing it out.  I'd like to get some feedback on which performance areas some of you use when building baselins.  These are the areas I'm pulling so far:

LUN/RAID Group/Disk/SP

Utilization

Response Time

Total Throughput

At first these seemed like the main areas to focus on, but after reading more on the characteristic definitions, I'm wondering if it's also a good idea to include queue length with response time.

From my initial research and chats with peers, I built the belief that focusing on total I/O was a better metric to collect rather than bandwidth, but now looking back, I'm not sure if that's always correct.  I see value in knowing how many reads and writes are being services in addition to the total I/O, but my fear is that trying to include all of these areas will make any type of performance report too bloated and not practical for anyone to use.

I also realize that values for response time will vary depending on the type of data, application, host, etc, so I'm also not sure how much real value that may be to include.

If some of you are willing, I'd very much like to see what performance areas you think are prudent to include in a general health overview.

Also, I'd be interested to hear about any supplemental software packages you may be using to monitor performance.

Thanks for reading,

Lee

131 Posts

December 1st, 2010 11:00

Response time = Service time * Queue Length.  So usually high queue lengths will show up in response time.

Here is typically what I look at for the basics.  I drill down into more of the metrics often, but these usually uncover issues which tell me which other metrics I need to look at.

LUN Metrics

Response time

Read/Write Throughput (I/O)

Read/Write Size - Usually large sizes indicate sequential operations (i.e. backups, full table scans, etc)

Bandwidth is just I/O * I/O size, so the metric itself isn't too useful.  It usually just reveals the type of I/O really.

Average Busy Queue Length - This can reveal host issues (If HBA queue depth is too high then you'll see this number really high).  It also reveals how "bursty" an application is.

Forced Flushes/s is also good to check if a particular LUN is having problems.

Disk/RAID Group Metrics

Total Throughput (Disk) - If you see the total throughput of an individual disk over around 160 IOPS for a 15k drive (180 is the rule of thumb, but usually very large I/O will max out out around 160 IOPS) It might be time to spread it across more spindles.

Compare read/write size to disk/RG to the LUN's read/write size - Larger reads/writes to disk than to LUN are usually good - it means cache is doing what it's supposed to do.

SP Metrics

% Dirty Pages

% Utilization

Check for Queue Full on SP ports

Check bandwidth on SP ports

5 Posts

December 2nd, 2010 08:00

Thank you for that feedback - that's exactly the sort of thing I'm looking for as I work to build my report models.

4.5K Posts

December 2nd, 2010 10:00

Please remember to mark the question as answered and to award points to the person that best answers your question.

glen

1.3K Posts

December 22nd, 2010 01:00

Also  i like monitoring disk crossings in addition and strict to the alignment at OS level , block size recommendation for application etc to get maximum benifit.

1.3K Posts

December 22nd, 2010 05:00

How do you correlate the RD/WR size to the performacne? More the sequential the better, Or do you have any other logics? Other way if the IOs are not sequential what you do ?

1.3K Posts

December 22nd, 2010 05:00

Does the high value on ABQL really matter? I have application whose LUNs reporting servicetime as well response time reporting lesser than 10(queu length less than one). But ABQL avareage above fifteen, but no performace issues. So what is peformance significance of ABQL other than indication of a "bursty" IO

4.5K Posts

December 22nd, 2010 11:00

For ABQL - see Primus emc204523 "What is the cause of high Queuing on CLARiiON drives? "

Also, for questions about Analyzer see Primus emc218359 - this is a list of most of the Primus solutions

glen

4.5K Posts

December 22nd, 2010 11:00

Large IO size - greater than 64KB could indication a backup operation - backups by their nature are generally sequential. The more sequential the IO, the better the cache works.

If you need to determine if IO is sequential or random, there are a number of ways to determine this:

1. If the IO is 100% Reads - look at the Read Cache Hit Ratio - the closer you get to 1 the more sequential the IO. If very sequential than the better pre-fetching works

2. If the IO is 100% Writes - look at Full Stripe Writes - if the number of Full Stripe Writes is the same as Total Write IOPS, then the data is very sequentail and the better the cache will work - you can eliminate the extra parity calculations when using raid 5.

Most of this is explanined in the two Best Practice documents:

EMC CLARiiON Performance and Availability Release 30 Firmware Update Applied Best Practices.pdf

EMC CLARiiON Storage System Fundamentals for Performance and Availability

glen

51 Posts

December 22nd, 2010 13:00

ABQL is a counter that can be useful if the ABQL is high, response will climb as the i/o waits to be serviced. if there are many host attached to the array it would be worth checking the HBA queue depth setting, 32 is a good starting point.

if the ABQL is partnered with Queue Full conditions then setting the HBA queue depth will certainly help the issue

5.7K Posts

December 23rd, 2010 00:00

- you can eliminate the extra parity calculations when using raid 5.

Well.... not entirily true. there's still a write penalty, but it is indeed lower than with random I/O or even with RAID10. With RAID10 the penalty is always 2 for writes, no matter if it's random or sequential. With RAID5 the penalty is 4 for random I/O. For sequential IOs the penalty is 1,25 in a 5 disk RAID5 Raid Group and 1,125 for a 10 disk RAID5 RG.

1.3K Posts

December 23rd, 2010 12:00

The dump method will work when we have navi analyzer licensed(nar file) though

1.3K Posts

January 8th, 2011 12:00

we consider a 15K rpm FC disks to be 180 IOPS for ideal calculations. But the actual numbers depends on the actual RD or WR sizes. So if my IO(RD/WR) is always under 4K( example exchange log) then how much IOPS i can expect from the same disk?.

I remember people mentioning they have observed even 300 IOPS in some cases , but not sure how small thier IO size was.

4.5K Posts

January 10th, 2011 13:00

It really varies based on IO size, randomness, etc. The latest Best Practice guide for flare 30 has some more detail on this in the sizing section near the end.

glen

75 Posts

January 11th, 2011 10:00

there are a number of factors which affect disk IOPS, not all of which we can detail on a public forum. You can ask for a USPEED rep to run your numbers through a sizing tool which will give you a very good estiamte. But you will have to know your level of sequentiality, read/write mix (what % is reads, what is writes) and predominant IO size.

For example, exchange log is sequential IO, the drives can absorb a lot of sequential small writes as they get bundled up in the cache and executed as a few large writes periodically. Uncached, sequential is also fast as the drives spin pretty quick, as long as you are reading/writing in the same direction they can lay down data prett fast :-)

No Events found!

Top