High Disk Queue Length

Question

We've been doing performance monitoring on our sql server as well as our array. We use a combination of Hyperic (only on the server), NaviAnalyzer, and Akorri's BP to monitor performace for both devices. Based on the information that we've gathered, we're seeing high disk queue length on the server side. We seem to think that it might be the HBA. The array was ruled out as a problem because NaviAnalyzer was not giving me the same the numbers that Hyperic was seeing. Not only that, but Akorri indicates that there it might be a server issue.

So here's my question...What is your execution throttle (or queue depth) setting on your Qlogic HBA that is connected to your Clariion? I was told the default setting is 256 since it was verified by SanSurfer. I'm thinking lowering this setting might decrease the high DQ that we're seeing on the server. Does anyone have suggestions or ideas on what to check, settings that should be considered, etc?

odurasler · Answer

Host=Windows 2003 64bitArray=6.26

SKT2 · Answer

which OS??

AranH1 · Answer

Glen,When did the storport recommendation change? We have been deploying 950903 based on the e-lab and support recommendations from a month or so ago. Now when I look in e-lab all the recommendations are for 943545.That sucks, I am no looking forward to going back and removing the new storport and installing the older storport.

kelleg · Answer

Make sure that you get the Qlogic drivers and installation guide from the EMC section on Qlogic's web site - the drivers are setup for clariion and the install guide have EMC' recommended settings. Also, be sure to install Microsoft hit fix 943545 as recommended on the site - this is very important for performance. Also, check to see if you have 950903 installed - if so, you should remove it - we've seen issues with this hot fix.

glen

shewitt1 · Answer

I am not crazy about the high EMC NVRAM default value for QLogic execution throttle. I discovered the issue on QLA2340s when I started using Invista (because it send alerts for queue full events).

The Clariion will respond with a queue full for two different events. This information is directly out of the Clariion Best Practices Planning guide for FLARE 26.

"A high degree of request concurrently is usually desirable, and results in a good return on investment. However if an array's queues are too large, it will respond with a queue-full flow control command. The CX3 front-end port drivers return a queue-full status command under two conditions:
- The total number of concurrent host requests at the port is 1,984 (internally, the port value is 2048, but 64 requests are reserved for special commands)
- The total number of requests for a given LUN at a given port is
(14 * (the number of data drives in the LUN)) + 32.

The host response to a queue-full is HBA-dependent, but it typically results in a suspension of activity for more than one second. Though rare, this can have serious consequences on throughput if this happens repeatedly."

I was having serious issues with some of my high IO servers with the default of 255. I had some LUNS on 4+4 R1_0. This makes the calculation 14*4+32=88. Therefore, if I ever had more than 88 requests for a given LUN on a given port, I was getting queue full events.

I have since manually changed the execution throttle on all of my QLogic cards down to 64 and no longer get the queue full events. I have only had good results from this change. If I understood it correctly, and remember it correctly, the QLogic waits a random amount of time up to 1 second before trying IO's again after getting a queue full. Can anyone confirm or deny this information?

I have notified a few people in EMC support that the default EMC NVRAM setting of 255 is too high for most environments and should be set lower, but I have not seen a change. You would need 16 data drives in a raid group to NOT hit the queue full message if you were using one clariion port. Unfortunately, the max # of spindles in a RG is 16, so it's impossible to have 16 data spindles in a RG. So, this means that your LUN would need to be a MetaLUN across at least two raid groups to NOT hit this threshold. I am not sure why EMC changed the value to 255 from QLogics default of 16 (as found in the QLA2340 help document).

Please use the above calculation to determine what your execution throttle should be set to, and make it a little bit lower. Or just set it to 64 and you'll be good to go in most cases (e.g. 4+1, 4+4 and anything larger). If you're using a bunch of 1+1s, you'll need to be lower.

I found that there is a setting on the QLogic called "Enable Extended Error Logging" which is supposed to log queue full events in the event log. There is also a counter in NaviAnalyzer on the SP Port that is supposed to show queue-full events, but I have had issues with that not working properly on some versions of FLARE (sorry, I can't recall which ones it did and didn't work on).

I hope this information can help you and others out there who are having poor disk performance due to queue full events.

kelleg · Answer

An errata was posted in the Jan. ESM to cover until the Feb. ESM was published.glen

odurasler · Answer

Thxs Glen, I will look into this.

odurasler · Answer

shewitt,Great info!I'm going to see if changing the execution throttle will make difference. Where is the 'Enable Extended Error Logging' located? Can this be adjusted in SanSurfer?Here's another question to you all...based on the (14*(number of data drives in a lun)) +32, would i still use this rule if my server is configured as follows?DB lun=2x4+1indexes/misc db=6+1tran logs=3+3tempdb/temp logs=2+2backup drive=6+1if added this correctly, it's about 566? isn't the max execution throttle at 256?

shewitt1 · Answer

The way I understand it, you need to set the execution throttle to work for the smallest LUN (smallest as in the least number of data disks in the RG). In your example your smallest is a 2+2, so, 14*2+32=60.
As far as I can tell, on the QLA2340 cards I have, this is a global setting for the entire HBA. Therefore, it needs to be set low enough to handle the "smallest" LUN. The Emulex cards I use (LPe11000) specifically say that the setting is per lun "Outstanding Requests on a per Lun or Target Basis (see QueueTarget)". In either case, you still need to be set to handle the "smallest" LUN.

If I was in your situation, I would change the value to 60 on the qlogic card and see if you notice any performance changes.
Also, check NaviAnalyzer for queue-full errors on the SP ports. As mentioned before, I did not see them in Analyzer even though I was getting them.
The "enable extended error logging" on the qla2340 is on the "settings" tab under "advanced HBA port settings" on the left side. It's a checkbox.

Is there anyone else out there who has seen similar results as me? Can anyone confirm my understanding of this setting?
It made a huge, positive, impact in my environment, so I'm hoping it'll help you as well.

riker82 · Answer

Hi,

I've a CX4-480 Flare 30 Patch 509 with 35x ESX Vsphere 4.1 attached on.

I've ALUA and Round Robin set, those can see:

1) 3x Datastore made of 3x extents as follow: 15x HDD 300 GB 15K (4+1 Raid5 x3), divided by 3 and striped between them with metalun (stripe multiplier 4)

2) 4x Datastore of 6x 146 GB 15K in RAID 10

3) 2x Datastore made of 2x Raid 10 of 4x 450 GB 15K by striped with the metalun

With all of this disks, I can see a lot of QUEUE on my SP with I do some mod on my vmware enviroment (storage vmotion or vm cloning....) etc...

I've Execution Throttle set to 256 on all of my hosts... Can It be too much great?

riker82 · Answer

Thanks for your answer. Can this setting be too much fewer for ESX? I saw this primus, but can this mod create some kind of bottleneck on my esx's hba?

kelleg · Answer

See Support Solution emc204523 'What is the cause of high Queuing on CLARiiON drives? ' on PowerLink. The Execution Throttle for ESX should be 32 (default), 256 is way too high. glen

kelleg · Answer

If the Execution Throttle for Qlogic HBA's on ESX is set to 256, that is too high - set to 32. Then test performance. glen

CLARiiON

High Disk Queue Length

Was this post helpful?