Understanding LUN Performance

Question

Could someone help me understand how to interpret LUN performance in Navi Analyzer? I have been doing some reading on past discussions but I'm looking for a clearer answer if there is one. My situation begins with a SQL 2005 DB cluster. There have been about three instances over the last few months where the cluster has failed from the active node to the passive node. Each time the storage performance comes into question sighting disk response time per the software used to monitor the cluster (Idera). According to the Analyzer data I don't see any major problems. The last time the failover occurred I did notice that the LUN containing the DB data (RAID 1/0, 4+4, dedicated) appeared to experience a high response time (64 ms) around the time of the failover. The LUN utilization and queue length (25% and 4.7 respectively) are rather low, at least as far as my knowledge is concerned. Forced flushes for that LUN are practically nothing during that same time period. The SPs' performance during that time period are as follows: Utilization 14 and 28%; Queue length 2.5; Resp time 1.75 ms; Dirty pages SPA 71%, SPB 89% (watermarks 80/60).

Now that I've said all that, I've read that LUN response time can or is likely to be falsely reported during times when the utilization and total throughput for that LUN are low and that you should only be concerned when all three of those metrics are high at the same time. Is this true? If so, is LUN response time a reliable metric to monitor? Some of these metrics seem to be a challenge to understandm, at least for me.

I apologize for the lengthy explanation but I wanted to be sure that whomever took this one on had the necessary details. Any help will be appreciated.

Thanks,

Tim

driskollt1 · Accepted Answer

LUN Response time is usually my starting point for troubleshooting. It gives a good indicator of what the host is experiencing. Usually if your LUN response time (Response time = queue length * service time)is good, then usually the host isn't seeing issues. High response times don't also mean that the CLARiiON is busy. It can also indicate that you're having issues with your host or Fabric.

LUN Utilization % is not a good metric to use - It doesn't report accurately. It tends to say utilization is higher than what it actually is. It can also be deceptive, as it may say a LUN is 100% utilized when another LUN is causing your problem

64 ms is fairly high, but I don't think a 64ms spike would cause a failover. That might just be an indicator of the host having issues. The host could have been having issues and responding slow before it finally dies and initiates a failover, that would also drive LUN response time up (A write isn't a write until the host acknowledges it). Also LUN response time would probably go up during the failover process.

At least half of your job managing storage is proving that the array/SAN is not the problem. Provide graphs to people who don't believe you. If you've laid everything out properly, then most likely it's a host issue (bad code, OS Problems, HBA Drivers, Bad HBA, etc)

Usually if I see high response times on a LUN, I check these things on the array

How busy is the RAID Group/Disks
- Check each disk in the RAID Group - If total throughput sustains around 180 IOPs for 15k FC (in a perfect world, really sustained should be less than 180 IOPS to give room for spikes - I start to get concerned when I see sustained over 120-130 for long periods of time), then you need more disks
- Check RAID Group Utilization % - Not a terribly good indicator, but gives a decent ball park figure

What else is happening with the LUNs in the same RAID Group
- Check each LUN to see what they are doing, if you have a busy LUN, then move it to another RAID Group
- Check each LUN for Reach/Write Size - If you have a LUN doing large reads or writes, possibly move it to another RAID Group. Large reads usually mean sequential I/O and sequential reads on the same disks can hurt your random read performance on other LUNs.
- Check average busy queue length. This is a good indicator of how "bursty" a LUN is.

Check the SPs - Check % Dirty Pages - make sure it isn't at 99%

Other stuff - Sometimes I select every single LUN and look at response times, Read/Write/Total bandwidth, Read/Write/Total Throughput, etc (Not very pretty, but it helps me narrow down what's really busy - You can mouse over data to see what LUN you're looking at) A poorly placed busy LUN can affect your entire array.

Once I go though the CLARiiON, I'll start looking at switch statistics to look for buffer credit issues, I'll also look at the host for HBA/Driver/Cable issues.

Parks2 · Answer

I do not know the anwser to your question but maybe you talk to your sales person who might have someone to help you .. EMC does offer classes as well ...

kelleg · Answer

Tim,

In the on-line HELP there is a lot of good information about using Analyzer and there is a statement about the accuracy of certain values (response time being the major one) when IOPS and queue length are both low. As IO's increase over about 100 IOPS, response times get more accurate. But IOPS under about 100 IOPS and queue length low (1-3) usually mean that response times will also be low.

The Best Practice Guides in the Documents section go into greater detail about each of these topics.

glen

tthomas1 · Answer

It was. I just haven’t made it back out there to mark it as answered. I appreciate your answer. That was what I was looking for. Thanks, Tim

kelleg · Answer

Was your question answered? If so, please remember to mark the question as answered and to award points to the person with the correct answer. glen

kelleg · Answer

Tim,Don't forget to mark it answered&#xa0; glen

CLARiiON

Understanding LUN Performance

Was this post helpful?