Analyzer: Queue Length vs Avg Busy Queue Length

Question

Can somebody explain the difference between the two? They sound like the same thing in the Navisphere Anayzer guide. Also, is there a rule-of-thumb in using one of these metrics to identify bottlenecks? I've read you should keep the LUN Queue length value less than the number of disks in the LUN ("EMC Analyzer: A Case Study) or each disk's Average Busy Queue Length between 1 and 1.5 (Analyzer admin guide).

Queue Length (D/S/L): The average number of requests within a certain time interval waiting to be served by the component, including the one in service. An (average) queue length of zero indicates an idle system. If three requests arrive at an empty service center at the same time, only one of them can be served immediately; the other two must wait in the queue, resulting in a queue length of three.

Average Busy Queue Length (D/S/L): Average number of requests waiting at a busy system component to be serviced, including the request that is currently in service. Since the queue length is counted only when the component is busy, the value indicates the frequency variation (burst frequency) of incoming requests. The higher the value, the bigger the burst, and
the longer the average response time at this component.

chrisp3 · Accepted Answer

I am curious about that myself...This is how I look at it but I am not sure if I am correct.Like all metrics the system is sampling.In queue length, if I sample 5 times and this is the sample1: 3 I/O2: 0 I/O3: 2 I/O4: 5 I/O5: 0 I/OThe average queue length is 10/5 = Queue Length 2Same metrics but doing Busy Queue Length, you throw out samples 2 and 5The busy queue length is 10/3 = Busy Queue Length 3.33Average Busy Queue LengthAverage number of requests waiting at a busy system component to be serviced, including the request that is currently in service. Since the queue length is counted only when the component is busy, the value indicates the frequency variation (burst frequency) of incoming requests. The higher the value, the bigger the burst, and the longer the average response time at this component

mronquillo1 · Answer

Sounds good to me. You couldn't have explained it any better.

ksnell · Answer

There is a couple of things to consider when looking at the queue reported within Navisphere Analyzer. The general idea is that if the average busy queue(AvByQ) is greater than the average Queue, then this means we have burstly activity; well yes and no.

Let me explain;
Typically with steady read and write activity to a LUN, the values of Queue and AvByQ will be relatively close, and is a good indication we have steady activity. Where that philosophy shifts somewhat is when you have a higher level of write activity and the array is under a reasonable load, a load where high watermark flushing is being used.
Writes, will typically go into cache, and with optimization it wants to keep as much data in write cache to then enable it to coalesce into larger IO or even full stripe writes to reduce backend activity at the disks. In doing this it will periodically write data out based on flushing algorithms. Where this gets busy is in high watermark flushing, that point at which higher priority processes are started to flush dirty pages from cache more aggresively to disk to reduce the current dirty page status of cache down to the low watermark setting.

During flushing there maybe in excess of 10 IO's at the disk, but will be sent in bursts from cache thus the actual disk activity will be seen as bursty whereas the higher level LUN may not have bursty activity, but relatively steady.

If the LUN AvByQ is much higher than the average queue, this is a normally an indication the host is bursty, and could be limiting your IOPs capability but will certainly be impacting IO response time. As the average response time reported by Analyzer is for the sample period, typically 2 minutes, you should check host reporting mechanisms to determine what the response time is (Perfmon, iostat, sar).

The desired values for queue and AvByQ are relative i.e. if I have a queue of 4 on a component and that component service time is 8ms, my IO will take 4*8=32ms. If I am happy with that, fine. If I cannot accept 32ms response time I need to keep the queue down i.e. less busy, less IOPs.

If the queue at the disk is driving the host IO response time up consistently, maybe more disks in that LUN will help distribute the load and lower the queue and therefore improve response time. If we are consistently bursty with higher AvByQ at the LUN level, you may need to look at the host configuration to determine why it's exhibiting burstly behaviour (host buffers maybe).

Thus, interpretation of Queue and AvByQ will depend on what component you are looking at, the IO distribution, cache utilization and correlation to host reported metrics as necessary.

CLARiiON

Analyzer: Queue Length vs Avg Busy Queue Length

Was this post helpful?