LUN Response Time on Storage Pool

Question

I am trying to wrap my head around something. My VNX knowledge is so-so, please forgive me. I have a FAST enabled three tier pool (EFD, SAS NLSAS) of course. This is a 5700 VNX. This pool is carved up into 10 2TB LUNs for our VMware environment.. I have VNX Monitoring and Reporting v2.0 installed and configured to alert me on LUN response times. I am getting alerts that one and sometimes two of the LUNs are exceeding the alerting threshold of 20ms.

What I don't understand is: If the Luns are striped across all disks in the pool equally shouldn't I be getting alerts for all 10 LUNs? THis may wound like a stupid question but if someone could provide clarification I would appreciate it.

dynamox · Answer

i find M&R alerts dubious at times, if you look at vSphere performance charts what do you see for response times for those datastores ? Does it correlate to much higher I/O on these datastores compared to others ?

rewalmilo · Answer

It can also be a matter of the performance footprint of this luns. Let say 10 % of the blocks for this luns are heavy hitters and migrated to the faster tiers. 90% of the blocks are located on the lower tiers. Then sometimes a task or job is started which is accessing all of the blocks equally. (backup, db reorgs, batch jobs… etc.) then will see higher response times for this luns. There is no free lunch.

sstover1 · Answer

I did verify that the nar files are showing the same information so I do not believe I am getting false information from the M&R tool.

dynamox · Answer

i would look at service times as well, if service times are low while response times are hight ..then you should also consider port utilization, HBA utilization, queue'ing

adhamakady · Answer

Hello,

Response Time is calculated using the I/O queue length divided by the throughput. So, as the throughput gets lower and approaches zero, the Response Time can show some very high values, which do not indicate any performance bottlenecks necessarily.

This is documented in this Knowledge Base article: https://support.emc.com/kb/174073.

I would take the response time measurements on the storage array seriously in case there is significant throughput on the LUN being analyzed, or if the host side is experiencing this latency or showing it in a host side performance monitoring tool as well. For example, do you see high response times on these LUNs from vSphere as well?

To check for performance bottlenecks overall, try for example seeing how much throughput storage pool disks are under (and whether the IOPS per disk are under the threshold for each disk type), Storage Processor write cache levels or Storage Processor utilization.

If you have any questions about that, please let me know.

GouthamG · Answer

Is the issue happening all the time or any specific times? If it is happening at the certain times only then check which activity causing the high I/O or If it is continuous problem then we have to check from the storage side.Let me know still do you have this problem?

Joshua_Fawcett · Answer

Looks like you have some good answers already in this thread but I wanted to speak to the last question/statement in your original post:

What I don't understand is: If the Luns are striped across all disks in the pool equally shouldn't I be getting alerts for all 10 LUNs?

In storage pools those 10 LUNs may or may not be striped across all disks in the pool. It's almost certain that they are not striped equally. So those 10 LUNs are probably all striped across different disks and hitting different disks at different times and spaces. The array will try to spread this load out across most or all of the drives and it will move things around within the pool during the autotiering window if certain sets of disks get overly busy but it will not be a perfect stripe across all drives. That would be why you may see some of those 10 LUNs with high response times and others may not be seeing that same contention.

If you want that data striped perfectly equal across all of the disks you would need to configured a bunch of traditional RAID Groups and create metaLUNs. But, of course, with this method you could not use multiple types of disks and make sure of the FAST VP autotiering benefits.

Anonymous User · Answer

Hi Joshua, Couldn't there be a possibility of getting more I/Os for a few LUNs, but not on others within a pool? That could make have given a high response time. Thanks Rakesh

Joshua_Fawcett · Answer

Rakesh,

I assume that you meant to ask if it is possible that more I/O could be going to a few disks within the pool than others?

If that is the case then, yes, you could absolutely see more I/O go to a subset of the disks within the pool. It's possible that one or two LUNs have a majority of their slices on a smaller subset of disks within the pool and those one or two LUNs get extremely busy all of a sudden. In this case, this would create a "hot spot" within the pool and those drives would get saturated. High response times could occur because of this.

If the increase in I/O on those couple of LUNs continues for a long enough period of time that it caused some of those internal slices to be ranked as more busy by the FAST algorithms, after that day, when the relocation window arrives, some of those slices will be promoted to higher tiers or to other subsets of less busy disks within the pool so as to balance and fit the new load.

FAST is a really good technology and the algorithms work pretty well. But it is not laid out as perfectly as if you used metaLUNs on equally carved out traditional RAID groups. If you want better control over where the I/O lands metaLUNs are great. But there are definitely benefits to be gained in using pools and FAST as well.

Hope that helps.

VNX

LUN Response Time on Storage Pool

Was this post helpful?