SPA and Average Queue Depth Range

Question

Can anyone help explain how to read these metrics when trying to figure out if the FA is queuing up?  I know what is one of the places to look but I don't understand what the ranges mean and if the reading in the graph is a problem or not? Thanks!

PedalHarder · Answer

When an I/O arrives at the FA port, the FA maintains counters as to what the current FA Queue depth is. These counters are processed and we arrive at the average queue depth range numbers.

So for each interval you see in SPA, the average queue depth range is the number of times an IO arrived at the FA port where the queue depth on the FA was within a particular range.

You should expect to see some queuing on a busy FA. If you are getting large or constant numbers in the the higher bucket range (320 and over), then it is likely that IO's are being delayed because of the FA queue depth and some further investigation and remedial action MAY be necessary to improve performance.

Typical causes of FA queuing are :

- The FA is too busy. The total IO workload becoming excessive.

- There are very large IO sizes intermixed with small IO's. Such as when a backup is running during the day when online is up.

- Hosts sending busts of IO's. e.g. SQL server is good at doing ths with it's checkpoint writes (AKA lazy writes) where it can burst thousands of writes at a time.

brekus · Answer

Jason,

Thanks for the help. This makes it a little more clear. It looks like the ranges in SPA go from 0-9. But the graphs show a different range. I guess the ranges in performance manager match the ranges in SPA but they just didn't title the metrics correctly.

brekus · Answer

Now if I find a FA that has some high queuing, what would be the next steps to find out what is causing the high queuing?

Quincy561 · Answer

Usually queuing on the FA is because the IO rate is more than the FA can deal with.  Sometimes averages over minutes don't show the real story, as IOs can arrive in bursts.

PedalHarder · Answer

As previously stated, it's most likely workload related.

Take a Performance Manager (PM) collection at small intervals... If you are running EMC Control Centre, you can do a 2 minute interval collection via the revolving DCP for instance.

Are your customers complaining about performance issues? If nobody thinks there is a performance problem, do you have a problem with high queue depth from time to time?

If you need to address the queuing issue because device response time is being hurt:

- Start by looking at your IO rates and throughputs on the FA ports. Is PM reporting high %util or %busy? If so, perhaps the FA ports are over utilised.

- If the IO / throughput is not too high, the issue could be bursting IO activity. To find this...For the FA you are interested in, identify the symdevs active on the FA port and look at the read and write rates for each device... you are looking for a correlation in IO activity on a device / set of devices and the queue depth.

When you have identified the devices causing the issue, you could look at the actual IO spikes at the host level with IOSTAT or perfmon at fine interval - 5 seconds. Now you can see the actual spikes in IO that are causing queuing on your FA ports.

Now are these hosts using Powerpath? are you getting even IO rates across the FA ports?

For these hosts, you may need more FA ports assigned to them to distribute the load across more FA ports.

Symmetrix

Was this post helpful?