qdepth and queued IO

Question

9000/800/rp8420/ B.11.11
DMX3-24/5772/PowerPath 4.5

We have noticed a performance issue with our server and contention identified at the disk level.We had two HBAs in the system which talks to the array and noticed there are lot of queued IOs at both the HBA. We added two more HBAs as a work around which helped to get rid of the "queued IO" situation as the new two paths share the load.

The limitation we see from OS side is the qdepth. With a queue depth of eight and two HBAs the server can drive 16 concurrent I/Os to each logical volume/LUN. That is why adding two more HBA helped the situation as the host can drive 32 concurrent IOs now.

root [/home/kumarts] kmtune|grep -i depth
scsi_max_qdepth 8 Y 8

I need to check with you if you have seen scenarios where qdepth increase helped the situation.( i have seen a thread recently which says qdepth incresed from default and had better performance. But cant track the exact one so far).So that we can have the advantage w/o additional paths.

Alasdair Eadie · Accepted Answer

Hi,

The max queue depth on Symmetrix DMX is dynamically determined at power cycle or when configuration changes are made, ensuring the best spread of "depth". Queue depth on Symmetrix ports is very rarely an issue, despite servicing many HBAs per port simultaneously.

Normally queue depth issues only occur on the host OS\HBA side, but again that is quite rare. We do want ideally some queuing to ensure maximum effectiveness and throughput of IO, but this can cost response times depending on how many IO threads are occurring on the HBAs.

The question was asked about depth and number of spindles, queue depth is not determined by the number of spindles (that's unknown to the HBA, SAN and Symmetrix FA).

The relationship between spindles and queue depth is best explained by an example (comments very welcome!)

Symmetrix Metavolumes and Host-Side Constraints on Storage Performance

Each Symmetrix volume (single volume or metavolume) is seen by the host server as one "physical disk". The host O\S, device driver and HBA usually allocate a fixed set of resources for each volume, regardless of size of volume etc. This means that if there is a large metavolume (16 members or more), it may not have enough host resources or I\O bandwidth allocated by the host to satisfy the performance requirements. This is not a concern for single-threaded applications, applications with low I\O, and applications with a high cache hit rate, but other environments may see performance scaling non-linearly with increasing I\O due to these issues.

The main issue described above is with host queues and "queue depth". Each volume recognised by a host gets an I\O queue with it's own queue depth. Consider a 90GB dataset, presented either as 10 x 9GB volumes or as a 1 x 9-member meta, with a queue depth of 8. In this way the maximum outstanding I\Os to a volume as seen from the host is:

Max Outstanding I\Os (non-meta) = 10 x 8 = 80

Max Outstanding I\Os (metavol)) = 1 x 8 = 8

This can have 3 effects. Firstly a non-meta can have 8 I\Os driven in parallel, whereas the meta can have only 1 (no Powerpath). Secondly, the maximum outstanding I\Os is relatively limited in meta environments so the Symm cannot work so efficiently grouping I\Os for destage. Thirdly, each I\O waiting in the queue will have the response time of all the preceeding I\Os in the queue plus it's own, so with meta's there is a larger chance of higher response times. The queue depth is generally set at O\S, device driver and HBA levels, and is often a fixed value. A too high value of queue depth can stall applications and must be considered carefully by an expert.

To a technical person this sounds like a bad move to use metavolumes, but this really only affects applications doing a lot of I\O to a large metavolume with a low cache hit rate, most applications on Symmetrix do not hit this bottleneck. The convenience and striping of metavolumes are big advantages and should also be considered. It is also possible to avoid the bottleneck with metavolumes by increasing the queue depth where possible and appropriate, and by using Powerpath which automatically gives each path has it's own queue.

The Symmetrix internal queue depth is much higher than host levels and is not a performance bottleneck. In fact the Symmetrix performance itself is the same for meta- or non-metavolumes, since the metavolume is a channel construct and not a "back-end" one. The performance varies only according to the host-side factors.

Metavolumes were created initially as an assistance to Microsoft for Windows, to allow Windows to work with large-scale storage despite it's limit of drive letters. Metavolumes provide the server administrator with a convenient and dynamically-expandable storage environment, which is automatically striped and and can be used without concern according to EMC recommendations. For this reason we generally recommend a maximum of 16 metavoume members, larger metavolume requests should be examined by an EMC performance guru if maximum performance is required.

Alasdair Eadie, Principal Technical Consultant, EMC EMEA Performance

Alastair2 · Answer

Hi,So you are using HP/UX and, no doubt, HP branded HBAs. I don't know anything about them specifically, but the following generalities may help.There are two modes for qdepth: as a pool or per LUN. There is a limit to the buffer pool on the HBA which may well be less than 128. That means that if the server has 10 LUNs, each could have an average of 12.8 IOs in the queue.If the mode is set to 'per LUN' and one LUN has a lot more activity than another, one queue will be often full and the other often empty. The total resource is therefore badly used.If the mode is set to 'pool', then the total resource (qdepth * #LUNs) is shared amongst all the LUNs so the busy one can get what it wants by using the resources not used by the unbusy one.However, if all the LUNs are hungry, there will still be problems. Increasing the qdepth and choosing the setting can make a big difference, but it depends on the behaviour of the application and its LUNs. Remember always that the (qdepth * #LUNs) should be less than the total resource possible on the given HBA.Try changing the settings and see what happens.Hope this helps,Alastair

ChuckReed1 · Answer

You might want to check you SAN too.qdepth will generally affect the buffer to buffer credits.Get too high, and you can start having discards.

SKT2 · Answer

FA seetings while we had problems.[/root] symmask hba listIdentifier Type Adapter Physical Device Path Dir:P---------------- ----- ---------------- ------------------------ -----50060b00000b3510 Fibre 0-0-8-1-0 /dev/rdsk/c96t0d0 04D:0 /dev/rdsk/c108t0d0 04A:0 /dev/rdsk/c110t0d0 04B:050060b00000b32ae Fibre 0-0-10-1-0 /dev/rdsk/c98t0d0 13D:0 /dev/rdsk/c112t0d0 13A:0 /dev/rdsk/c114t0d0 13B:0Later added two more cards.50060b00000af328 Fibre 1-0-10-1-0 /dev/rdsk/c120t0d0 04A:0 /dev/rdsk/c122t0d0 04B:050060b00000b3c4c Fibre 1-0-8-1-0 /dev/rdsk/c116t0d0 13A:0 /dev/rdsk/c118t0d0 13B:0But still the FA (13 and 4) remains same. Can this indcicare there are no contentions at FA/controller level?Could some one explain about the Outstanding IOs FA port can handle? what are the limits here.?

SKT2 · Answer

How can i identify the 'buffer pool limit' at the HBA level? i thought it was 256 not 128? Increasing the qdepth and 'choosing the setting' ; what is the setting you are talking about here?

Alasdair Eadie · Answer

Hi,The quick answer is that increasing the queue depth on each HBA can improve overall throughput performance compared to setting qdepth to 8. We have seldom seen circumstances where increasing queue depth improves overall performance though. You could certainly increase to 32 per HBA without issues and it may improve overall performance, especially if you are using metavolumes with more than 16 members. Be careful not to go above this value in general usage, especially if the server(s) are clustered, too many outstanding IOs in a cluster failover sometimes causes failover to fail.Since this is a forum Q&A and not a performance analysis it's a matter of trying and seeing, there's no way to tell the difference caused by using more HBAs (making more parallel queues) or using less HBAs (less queues but bigger) except by trying. EMC offers performance healthcheck services to customers, which cost money but can examine the performance situation from 'top-to-bottom' and identify how\what\where things can be improved.Greetings, Alasdair, Lead RTS, EMC Europe

SKT2 · Answer

can you clarify 'queue depth on each HBA'?? AFAIK this is done at kernel level (scsi_max_qdepth) or at individual LUN level.(scsictl)

RRR · Answer

Does anyone know the max queue depth for a Clariion port ? I think it's 2048, but I'm not sure.
And for a DMX port ?

Is the max queue depth in any way related to the number of spindels you're accessing through a specific port ?

SKT2 · Answer

we ended up adding two more HBAs and two more new FA and everything is stable for a logn time

BCHANCELLOR · Answer

Keep in mind that most hosts queue depth is actually a per path queue depth rather than a queue on the pseudo device.  So you can get additional host side queues by adding additional FA ports, rather than just adding HBAs.

Symmetrix

qdepth and queued IO

Was this post helpful?