StephN_AU
1 Nickel

LUN with high response times ? How to fix?


Hi All

Im having a performance issue with one of my applications and traced down the cause I think - I just dont know how to fix it so if any one has any ideas I would really appreciate it.

Scenario:  Large SQL server, virtualised on ESX 5.0, running a VNX 5500 unified, large 3 tier storage pool, with FASTVP & FASTcache (1000GB) Flare 32 patch 015

One of the LUNs attached to the server shows high response times at times >40ms, we also get high device and kernel latency at the ESX level, during those periods the write size of the IO is high 128KB+.  Read size is variable, but large read sizes with tiny writes doesnt have the same impact on latency.

From what Ive read writes that size are going to be bypassing FASTCache ?  Is this correct?

What can I do to improve performance with the large write size coming through ?

Oh yes.. critical high profile project being affected - as always!

Thanks

Tags (3)
0 Kudos
14 Replies
dynamox
6 Gallium

Re: LUN with high response times ? How to fix?

Fast Cache tracks data temperature at granularity of 64KB, so it should not be trying to promote anything. Is this server using vmdk/VMFS or RDM. Did you look at drives that make up the pool, are they busy (in terms of IOPS) ?

0 Kudos
StephN_AU
1 Nickel

Re: LUN with high response times ? How to fix?

Hi dynamox, the drive is a VMDK, on a VMFS5 volume.  It is the only VMDK on the LUN ~1.25TB in size and is presented only to that guest.  We use the PVSCSI virtual adapters for our SQL Server guests as well.

The Pool is very large, 25 x 200GB EFD, 70 x 600GB 15K SAS, and 70 x 2TB NL-SAS - All are in RAID5(4+1) as it was first built 2.5 years ago on Flare 31 (and before best practise said to limit max disk in pools)

I havent braved analyzer and 165 disks in a single chart so far.. I assumed that the load would have been evenly spread.

LUN tiering profile is 2.8% / 88.6%/ 8.52% with a policy of autotiering.

Cheers

0 Kudos
StephN_AU
1 Nickel

Re: LUN with high response times ? How to fix?

Disk IO stats for the pool - very rough:

EFD Tier : IOPS peak at about 512/s max utilisation 31%, avg ~16%

SAS Tier: IOPS avg about 90/s, overall peaks up to 200/s.  I have one disk peaking at 343/s.  Utilisation averages about 30% and peaks about 62%  (PS: for the entire 8 hour window Im analysing utilisation for all SAS disks never once drops below ~15%)

NLSAS Tier:  I can see one HOT Raid group, its peaking about 70% utilisation and the rest are at 14%. IOPS for the HOT RG are up to 200/s and for the rest 50/s or lower.

So if I have one HOT RG in a pool, how do I control data placement to spread the load?   For a traditional LUN I would move that to a more appropriate RG or disk type but with a pool I dont have that control.

Thanks

0 Kudos
dynamox
6 Gallium

Re: LUN with high response times ? How to fix?

i don't think your pool is that big, mine has a lot more drives. I personally concentrate on IOPS and response time ..not as much on utilization. As long as you have a least one I/O ..disk will report as being utialized. SAS tier ..are those 10k or 15k drives ?  Maybe they are being worked a little too hard and time to add more drives to the pool ?

StephN_AU
1 Nickel

Re: LUN with high response times ? How to fix?

SAS are 15K rpm, and yes they are very busy - more than I expected.

We are reaching physical limit of disks for the VNX, I could maybe get another 2 RAID groups in there but we are due for a refresh next year so there may be some reluctance to spend $$s in the short term.

Would changing the autotiering policy to Highest first, then auto make a difference ?

Steph

0 Kudos
Roger_Wu
4 Ruthenium

Re: LUN with high response times ? How to fix?

Can you share the NAR file? Maybe I can generate some performance analysis report for you. You can also submit a case to EMC support if it's a real case.

0 Kudos
dynamox
6 Gallium

Re: LUN with high response times ? How to fix?

i would try highest and leave it there. Are you seeing high response all the time or specific time periods ?

0 Kudos
Joshua_Fawcett
1 Nickel

Re: LUN with high response times ? How to fix?

Using the "high then auto" policy only means that new slices will be obtained from the highest tier possible and then will be auto-tiered to the correct tier (as deemed appropriate by FAST on the array) at the first opportunity. If this LUN is not constantly getting new slices that may not help much. It probably wouldn't hurt either but it most likely wouldn't help much more than just having it set to "auto-auto".

Have you noticed any other signs of bad performance - forced flushing, queue full counts on the SP ports, high SP utilization, etc.?

0 Kudos
Highlighted
StephN_AU
1 Nickel

Re: LUN with high response times ? How to fix?

In response to questions:

I pulled yesterdays nar file, and see high response times all day on this LUN (min 4ms, max 50ms, avg about 20ms).  No specific correlation to any other activity eg IO, Bandwidth, ABQL - they all fluctuate but not in line with the response times

Checked several other  SQL LUNs on the same array including on of our "beast" applications  and for the same period they are avg 1-2ms, peaks at 3 - 4ms.

SP utilisation sits about 50% during the day and there is no evidence of dirty pages being abnormally high.  2 spikes to 90% for the whole day, water marks are 50% & 70%.

No forced flush stats on the LUN because its in a pool (frustrating!)

@Roger - thanks for the offer , I will be opening a case with EMC, also our local EMC have a good performance guy and I will chat to him and see if he has any ideas.

Thanks all

0 Kudos