Unsolved

This post is more than 5 years old

6 Operator

 • 

5.7K Posts

4824

January 25th, 2013 01:00

QFULL and Execution Throttle / queue depth

Hello everybody,

I'm in a long and painful discussion with a colleague of mine about preventing QFULLs from the storage ports. For those who don't know what QFULLs are: these are little messages sent from a storage port if the outstanding IOs are hitting the ports ceiling of 1600 (on VNX). If a new IO reaches the port and the port hit its limit, the port will send back a QFULL message to the initiator and the OS belonging to that HBA will have to deal with it. In the old days OSs might have responwith these nice BSODs, but nowadays it's dealt with in a way that IOs are paused and after a little while will slowly try to start running to the storage ports again.

I noticed that a few years ago the (QLogic) "execution throttle" had a max value of 256, but nowadays I'm seeing as high as 64k (!!!). Bare in mind that a single storage port on Clariion or VNX can only deal with 1600 outstanding IOs. So if a server sends out too many IOs that end up in the "outstanding IO queue" on the CX/VNX, these nasy little QFULLs start flowing in again. I've seen VNX 5700s getting hundreds of them every few minutes or so (in Analyzer real time view), so I can imagine that the customer will notice delays all the time.

The way we can solve this anoying QFULL protection mechanism is to set the HBAs "execution throttle" to a more convenient level like 32 or 16, depending on the number of servers attached to each port and the number of LUNs each server actually uses. In VMware there are 2 knowledge base articles you might want to read: http://kb.vmware.com/kb/1267 and http://kb.vmware.com/kb/1268.

My question here is: does anyone other than me actually uses this setting? I know customers who are, but also a few that don't and I'm looking for a piece of documentation which I can use to convince my collegue to start using the setting.

1 Rookie

 • 

85 Posts

January 25th, 2013 01:00

Hi RRR,

Most of the cases the VNX port queue depth of 1600 is not a bottleneck. If you run ALUA across 4 paths and distribute luns across both SP's you will effectively have a max queue depth for your ESX cluster(s) of 6400. In personal I don't see this as a bottleneck in most of my large customerconfigs.

However there is another issue with QueueFull condition in combination with lun-queuefull. The queue depth is limited per lun. This means that a certain queue on a single lun can cause a QFull to be reported to the host. My experience is that server QFull conditions are mostly caused due to lun queue-full conditions, and not to port queue fulls.

The lun queue full conditions will occur quite fast. With Flare 31 and R5 41 pools the lun queue full will be triggered with 88 IO's in the queue.  So even with a pool of 100 drives the max queue depth on a lun on this pool will be 88. With Flare32 this number increases, depending on the number of datadrives in the pool. For example with 20 drives in a pool this number raises to 224 IOPS. See also emc204523. Especially with a low number of large luns in a VMFS config you can hit this limit relatively quicker than the port queue limit.

And finally what are we doing with VMware? I allways make customers aware of the queue full possibilities. But lowering the HBA-queue depth settings for each environment I usually don't recommend. What should you set? In my opinion it's better to monitor QFull conditions and take actions on specific severs/environments if needed.

In ESX3.5 adaptive queuedepth throttling is introduced. In my opinion this is a better approach than lowering HBA queue depths globally. In ESX5i it's even possible to configure adaptive queue depth on an individual disk basis.

So to summarize my opinion:

    • Monitor for Queue Full conditions
    • Be aware of the lun queue limits
    • In ESX environments use adaptive queue depth rather than limiting all HBA instances.

6 Operator

 • 

5.7K Posts

January 25th, 2013 11:00

Thanks for explaning this. I didn't know these things about VMware

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

January 28th, 2013 17:00

can i see LUN queue in Analyzer ?

6 Operator

 • 

4.5K Posts

January 29th, 2013 12:00

In Analyzer you can look at the Queue Length or the Average Busy Queue Length (better as this shows what the queue is when the object is busy). This is the Queue that is referred to above (14 * disks) +32  --  ex. R5 <4+1> would be (14 * 4) + 32 = 88.

glen

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

January 29th, 2013 13:00

thank you, i never understood what optimal and nonoptimal next to the counter mean ?

6 Operator

 • 

4.5K Posts

January 29th, 2013 14:00

That's supposed to be when the access to the data is using the CMI bus - like with a ALUA trespass - it's not working yet - optimal path (direct path), non-optimal path (ALUA using CMI bus).

glen

6 Operator

 • 

5.7K Posts

January 29th, 2013 23:00

Wow, it hardly happens that you don't know some specific thing

6 Operator

 • 

5.7K Posts

January 29th, 2013 23:00

Thanks. I knew the queue length for a LUN was 88, but I never knew a LUN queue FULL was an actual status as well.

11 Legend

 • 

20.4K Posts

 • 

87.4K Points

January 30th, 2013 03:00

RRR wrote:

Wow, it hardly happens that you don't know some specific thing

happens all the time

6 Operator

 • 

5.7K Posts

February 4th, 2013 13:00

That can only mean 1 thing: you’re not a bot after all

11 Posts

March 29th, 2014 02:00

Hi all! And what about lun queue length at MCx?

6 Operator

 • 

5.7K Posts

March 31st, 2014 05:00

Hello alkhvo: as far as I know it remains the same with MCx.

No Events found!

Top