SAN_Lee
1 Nickel

Clariion Performance Metrics to Focus On

Jump to solution

Hello All,

I'm working to build an initial performance review for some CX3-80 arrays.  All of the disks are I've gathered Navisphere Analyzer data for a week and I'm now processing it out.  I'd like to get some feedback on which performance areas some of you use when building baselins.  These are the areas I'm pulling so far:

LUN/RAID Group/Disk/SP

Utilization

Response Time

Total Throughput

At first these seemed like the main areas to focus on, but after reading more on the characteristic definitions, I'm wondering if it's also a good idea to include queue length with response time.

From my initial research and chats with peers, I built the belief that focusing on total I/O was a better metric to collect rather than bandwidth, but now looking back, I'm not sure if that's always correct.  I see value in knowing how many reads and writes are being services in addition to the total I/O, but my fear is that trying to include all of these areas will make any type of performance report too bloated and not practical for anyone to use.

I also realize that values for response time will vary depending on the type of data, application, host, etc, so I'm also not sure how much real value that may be to include.

If some of you are willing, I'd very much like to see what performance areas you think are prudent to include in a general health overview.

Also, I'd be interested to hear about any supplemental software packages you may be using to monitor performance.

Thanks for reading,

Lee

Labels (1)
0 Kudos
1 Solution

Accepted Solutions
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Response time = Service time * Queue Length.  So usually high queue lengths will show up in response time.

Here is typically what I look at for the basics.  I drill down into more of the metrics often, but these usually uncover issues which tell me which other metrics I need to look at.

LUN Metrics

Response time

Read/Write Throughput (I/O)

Read/Write Size - Usually large sizes indicate sequential operations (i.e. backups, full table scans, etc)

Bandwidth is just I/O * I/O size, so the metric itself isn't too useful.  It usually just reveals the type of I/O really.

Average Busy Queue Length - This can reveal host issues (If HBA queue depth is too high then you'll see this number really high).  It also reveals how "bursty" an application is.

Forced Flushes/s is also good to check if a particular LUN is having problems.

Disk/RAID Group Metrics

Total Throughput (Disk) - If you see the total throughput of an individual disk over around 160 IOPS for a 15k drive (180 is the rule of thumb, but usually very large I/O will max out out around 160 IOPS) It might be time to spread it across more spindles.

Compare read/write size to disk/RG to the LUN's read/write size - Larger reads/writes to disk than to LUN are usually good - it means cache is doing what it's supposed to do.

SP Metrics

% Dirty Pages

% Utilization

Check for Queue Full on SP ports

Check bandwidth on SP ports

0 Kudos
14 Replies
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Response time = Service time * Queue Length.  So usually high queue lengths will show up in response time.

Here is typically what I look at for the basics.  I drill down into more of the metrics often, but these usually uncover issues which tell me which other metrics I need to look at.

LUN Metrics

Response time

Read/Write Throughput (I/O)

Read/Write Size - Usually large sizes indicate sequential operations (i.e. backups, full table scans, etc)

Bandwidth is just I/O * I/O size, so the metric itself isn't too useful.  It usually just reveals the type of I/O really.

Average Busy Queue Length - This can reveal host issues (If HBA queue depth is too high then you'll see this number really high).  It also reveals how "bursty" an application is.

Forced Flushes/s is also good to check if a particular LUN is having problems.

Disk/RAID Group Metrics

Total Throughput (Disk) - If you see the total throughput of an individual disk over around 160 IOPS for a 15k drive (180 is the rule of thumb, but usually very large I/O will max out out around 160 IOPS) It might be time to spread it across more spindles.

Compare read/write size to disk/RG to the LUN's read/write size - Larger reads/writes to disk than to LUN are usually good - it means cache is doing what it's supposed to do.

SP Metrics

% Dirty Pages

% Utilization

Check for Queue Full on SP ports

Check bandwidth on SP ports

0 Kudos
SAN_Lee
1 Nickel

Re: Clariion Performance Metrics to Focus On

Jump to solution

Thank you for that feedback - that's exactly the sort of thing I'm looking for as I work to build my report models.

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Please remember to mark the question as answered and to award points to the person that best answers your question.

glen

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Also  i like monitoring disk crossings in addition and strict to the alignment at OS level , block size recommendation for application etc to get maximum benifit.

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Does the high value on ABQL really matter? I have application whose LUNs reporting servicetime as well response time reporting lesser than 10(queu length less than one). But ABQL avareage above fifteen, but no performace issues. So what is peformance significance of ABQL other than indication of a "bursty" IO

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

How do you correlate the RD/WR size to the performacne? More the sequential the better, Or do you have any other logics? Other way if the IOs are not sequential what you do ?

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

For ABQL - see Primus emc204523 "What is the cause of high Queuing on CLARiiON drives? "

Also, for questions about Analyzer see Primus emc218359 - this is a list of most of the Primus solutions

glen

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

Large IO size - greater than 64KB could indication a backup operation - backups by their nature are generally sequential. The more sequential the IO, the better the cache works.

If you need to determine if IO is sequential or random, there are a number of ways to determine this:

1. If the IO is 100% Reads - look at the Read Cache Hit Ratio - the closer you get to 1 the more sequential the IO. If very sequential than the better pre-fetching works

2. If the IO is 100% Writes - look at Full Stripe Writes - if the number of Full Stripe Writes is the same as Total Write IOPS, then the data is very sequentail and the better the cache will work - you can eliminate the extra parity calculations when using raid 5.

Most of this is explanined in the two Best Practice documents:

EMC CLARiiON Performance and Availability Release 30 Firmware Update Applied Best Practices.pdf

EMC CLARiiON Storage System Fundamentals for Performance and Availability

glen

0 Kudos
8 Krypton

Re: Clariion Performance Metrics to Focus On

Jump to solution

ABQL is a counter that can be useful if the ABQL is high, response will climb as the i/o waits to be serviced. if there are many host attached to the array it would be worth checking the HBA queue depth setting, 32 is a good starting point.

if the ABQL is partnered with Queue Full conditions then setting the HBA queue depth will certainly help the issue

0 Kudos