StephN_AU

51 Posts

7734

December 3rd, 2013 15:00

LUN with high response times ? How to fix?

Hi All

Im having a performance issue with one of my applications and traced down the cause I think - I just dont know how to fix it so if any one has any ideas I would really appreciate it.

Scenario: Large SQL server, virtualised on ESX 5.0, running a VNX 5500 unified, large 3 tier storage pool, with FASTVP & FASTcache (1000GB) Flare 32 patch 015

One of the LUNs attached to the server shows high response times at times >40ms, we also get high device and kernel latency at the ESX level, during those periods the write size of the IO is high 128KB+. Read size is variable, but large read sizes with tiny writes doesnt have the same impact on latency.

From what Ive read writes that size are going to be bypassing FASTCache ? Is this correct?

What can I do to improve performance with the large write size coming through ?

Oh yes.. critical high profile project being affected - as always!

Thanks

Responses(14)

S

StephN_AU

51 Posts

0

December 3rd, 2013 16:00

Hi dynamox, the drive is a VMDK, on a VMFS5 volume. It is the only VMDK on the LUN ~1.25TB in size and is presented only to that guest. We use the PVSCSI virtual adapters for our SQL Server guests as well.

The Pool is very large, 25 x 200GB EFD, 70 x 600GB 15K SAS, and 70 x 2TB NL-SAS - All are in RAID5(4+1) as it was first built 2.5 years ago on Flare 31 (and before best practise said to limit max disk in pools)

I havent braved analyzer and 165 disks in a single chart so far.. I assumed that the load would have been evenly spread.

LUN tiering profile is 2.8% / 88.6%/ 8.52% with a policy of autotiering.

Cheers

dynamox

1 Rookie

•

20.4K Posts

0

December 3rd, 2013 16:00

Fast Cache tracks data temperature at granularity of 64KB, so it should not be trying to promote anything. Is this server using vmdk/VMFS or RDM. Did you look at drives that make up the pool, are they busy (in terms of IOPS) ?

dynamox

1 Rookie

•

20.4K Posts

1

December 3rd, 2013 17:00

i don't think your pool is that big, mine has a lot more drives. I personally concentrate on IOPS and response time ..not as much on utilization. As long as you have a least one I/O ..disk will report as being utialized. SAS tier ..are those 10k or 15k drives ? Maybe they are being worked a little too hard and time to add more drives to the pool ?

S

StephN_AU

51 Posts

0

December 3rd, 2013 17:00

Disk IO stats for the pool - very rough:

EFD Tier : IOPS peak at about 512/s max utilisation 31%, avg ~16%

SAS Tier: IOPS avg about 90/s, overall peaks up to 200/s. I have one disk peaking at 343/s. Utilisation averages about 30% and peaks about 62% (PS: for the entire 8 hour window Im analysing utilisation for all SAS disks never once drops below ~15%)

NLSAS Tier: I can see one HOT Raid group, its peaking about 70% utilisation and the rest are at 14%. IOPS for the HOT RG are up to 200/s and for the rest 50/s or lower.

So if I have one HOT RG in a pool, how do I control data placement to spread the load? For a traditional LUN I would move that to a more appropriate RG or disk type but with a pool I dont have that control.

Thanks

dynamox

1 Rookie

•

20.4K Posts

0

December 3rd, 2013 18:00

i would try highest and leave it there. Are you seeing high response all the time or specific time periods ?

Roger_Wu

4K Posts

0

December 3rd, 2013 18:00

Can you share the NAR file? Maybe I can generate some performance analysis report for you. You can also submit a case to EMC support if it's a real case.

S

StephN_AU

51 Posts

0

December 3rd, 2013 18:00

SAS are 15K rpm, and yes they are very busy - more than I expected.

We are reaching physical limit of disks for the VNX, I could maybe get another 2 RAID groups in there but we are due for a refresh next year so there may be some reluctance to spend $$s in the short term.

Would changing the autotiering policy to Highest first, then auto make a difference ?

Steph

Joshua_Fawcett

35 Posts

0

December 4th, 2013 08:00

Using the "high then auto" policy only means that new slices will be obtained from the highest tier possible and then will be auto-tiered to the correct tier (as deemed appropriate by FAST on the array) at the first opportunity. If this LUN is not constantly getting new slices that may not help much. It probably wouldn't hurt either but it most likely wouldn't help much more than just having it set to "auto-auto".

Have you noticed any other signs of bad performance - forced flushing, queue full counts on the SP ports, high SP utilization, etc.?

S

StephN_AU

51 Posts

0

December 4th, 2013 21:00

In response to questions:

I pulled yesterdays nar file, and see high response times all day on this LUN (min 4ms, max 50ms, avg about 20ms). No specific correlation to any other activity eg IO, Bandwidth, ABQL - they all fluctuate but not in line with the response times

Checked several other SQL LUNs on the same array including on of our "beast" applications and for the same period they are avg 1-2ms, peaks at 3 - 4ms.

SP utilisation sits about 50% during the day and there is no evidence of dirty pages being abnormally high. 2 spikes to 90% for the whole day, water marks are 50% & 70%.

No forced flush stats on the LUN because its in a pool (frustrating!)

@Roger - thanks for the offer , I will be opening a case with EMC, also our local EMC have a good performance guy and I will chat to him and see if he has any ideas.

Thanks all

dynamox

1 Rookie

•

20.4K Posts

0

December 5th, 2013 05:00

Steph,

let us know what comes out of their assessment and the steps to remediate the problem.

S

StephN_AU

51 Posts

0

December 18th, 2013 15:00

Hi All

I have some feedback from EMC, their assessment is both the READ and WRITE IO sizes are causing the high latency on the LUN. This is the summary of the solution:

For time-sensitive application, as a workaround increasing the tier for
LUN573 will improve its performance to a certain degree.

Right click LUN and select Tiering Policy->Highest Available
Tier.

The final resolution is to adjust IO size to make LUN more responsive.

dynamox

1 Rookie

•

20.4K Posts

0

December 18th, 2013 17:00

Steph wrote:

The final resolution is to adjust IO size to make LUN more responsive.

is that tunable in MS SQL ?

benconrad1

105 Posts

0

December 18th, 2013 19:00

Do the LUNs in question happen to be updated by SQL Server Transactional Replication?

Transactional Replication Overview

Ben

DS

Dinesh-sengar

7 Posts

0

September 29th, 2014 13:00

Hi Ben,

We are experiencing a similar problem wherein a VNX LUN allocated to a ESX host is shared with 5 VM's. One of the VM is hosting SQL database and the data is this SQL database is updated by SQL server Transactional Replication. The VNX LUN is experiencing high response time ( ~50ms ) but having nice service time (under 10ms). SQL host is getting IO timeout errors. Your above comment suggests that some tuning needs to be done on SQL side if SQL Server Transactional Replication is in place. COuld you please huide further on this, What exactly needs to be done ? Or if you have some white paper/best practices for it. Appreciate your help. Thanks

View All

No Events found!