LUN Response time is usually my starting point for troubleshooting. It gives a good indicator of what the host is experiencing. Usually if your LUN response time (Response time = queue length * service time)is good, then usually the host isn't seeing issues. High response times don't also mean that the CLARiiON is busy. It can also indicate that you're having issues with your host or Fabric.
LUN Utilization % is not a good metric to use - It doesn't report accurately. It tends to say utilization is higher than what it actually is. It can also be deceptive, as it may say a LUN is 100% utilized when another LUN is causing your problem
64 ms is fairly high, but I don't think a 64ms spike would cause a failover. That might just be an indicator of the host having issues. The host could have been having issues and responding slow before it finally dies and initiates a failover, that would also drive LUN response time up (A write isn't a write until the host acknowledges it). Also LUN response time would probably go up during the failover process.
At least half of your job managing storage is proving that the array/SAN is not the problem. Provide graphs to people who don't believe you. If you've laid everything out properly, then most likely it's a host issue (bad code, OS Problems, HBA Drivers, Bad HBA, etc)
Usually if I see high response times on a LUN, I check these things on the array
How busy is the RAID Group/Disks
Check each disk in the RAID Group - If total throughput sustains around 180 IOPs for 15k FC (in a perfect world, really sustained should be less than 180 IOPS to give room for spikes - I start to get concerned when I see sustained over 120-130 for long periods of time), then you need more disks
Check RAID Group Utilization % - Not a terribly good indicator, but gives a decent ball park figure
What else is happening with the LUNs in the same RAID Group
Check each LUN to see what they are doing, if you have a busy LUN, then move it to another RAID Group
Check each LUN for Reach/Write Size - If you have a LUN doing large reads or writes, possibly move it to another RAID Group. Large reads usually mean sequential I/O and sequential reads on the same disks can hurt your random read performance on other LUNs.
Check average busy queue length. This is a good indicator of how "bursty" a LUN is.
Check the SPs - Check % Dirty Pages - make sure it isn't at 99%
Other stuff - Sometimes I select every single LUN and look at response times, Read/Write/Total bandwidth, Read/Write/Total Throughput, etc (Not very pretty, but it helps me narrow down what's really busy - You can mouse over data to see what LUN you're looking at) A poorly placed busy LUN can affect your entire array.
Once I go though the CLARiiON, I'll start looking at switch statistics to look for buffer credit issues, I'll also look at the host for HBA/Driver/Cable issues.
I do not know the anwser to your question but maybe you talk to your sales person who might have someone to help you .. EMC does offer classes as well ...
In the on-line HELP there is a lot of good information about using Analyzer and there is a statement about the accuracy of certain values (response time being the major one) when IOPS and queue length are both low. As IO's increase over about 100 IOPS, response times get more accurate. But IOPS under about 100 IOPS and queue length low (1-3) usually mean that response times will also be low.
The Best Practice Guides in the Documents section go into greater detail about each of these topics.
driskollt1
131 Posts
0
September 28th, 2010 08:00
LUN Response time is usually my starting point for troubleshooting. It gives a good indicator of what the host is experiencing. Usually if your LUN response time (Response time = queue length * service time)is good, then usually the host isn't seeing issues. High response times don't also mean that the CLARiiON is busy. It can also indicate that you're having issues with your host or Fabric.
LUN Utilization % is not a good metric to use - It doesn't report accurately. It tends to say utilization is higher than what it actually is. It can also be deceptive, as it may say a LUN is 100% utilized when another LUN is causing your problem
64 ms is fairly high, but I don't think a 64ms spike would cause a failover. That might just be an indicator of the host having issues. The host could have been having issues and responding slow before it finally dies and initiates a failover, that would also drive LUN response time up (A write isn't a write until the host acknowledges it). Also LUN response time would probably go up during the failover process.
At least half of your job managing storage is proving that the array/SAN is not the problem. Provide graphs to people who don't believe you. If you've laid everything out properly, then most likely it's a host issue (bad code, OS Problems, HBA Drivers, Bad HBA, etc)
Usually if I see high response times on a LUN, I check these things on the array
Parks2
116 Posts
0
September 13th, 2010 18:00
I do not know the anwser to your question but maybe you talk to your sales person who might have someone to help you .. EMC does offer classes as well ...
kelleg
4 Operator
•
4.5K Posts
1
September 20th, 2010 15:00
Tim,
In the on-line HELP there is a lot of good information about using Analyzer and there is a statement about the accuracy of certain values (response time being the major one) when IOPS and queue length are both low. As IO's increase over about 100 IOPS, response times get more accurate. But IOPS under about 100 IOPS and queue length low (1-3) usually mean that response times will also be low.
The Best Practice Guides in the Documents section go into greater detail about each of these topics.
glen
tthomas1
10 Posts
0
October 5th, 2010 09:00
It was. I just haven’t made it back out there to mark it as answered. I appreciate your answer. That was what I was looking for.
Thanks,
Tim
kelleg
4 Operator
•
4.5K Posts
0
October 5th, 2010 09:00
Was your question answered? If so, please remember to mark the question as answered and to award points to the person with the correct answer.
glen
kelleg
4 Operator
•
4.5K Posts
0
October 5th, 2010 11:00
Tim,
Don't forget to mark it answered
glen