Im having a performance issue with one of my applications and traced down the cause I think - I just dont know how to fix it so if any one has any ideas I would really appreciate it.
Scenario: Large SQL server, virtualised on ESX 5.0, running a VNX 5500 unified, large 3 tier storage pool, with FASTVP & FASTcache (1000GB) Flare 32 patch 015
One of the LUNs attached to the server shows high response times at times >40ms, we also get high device and kernel latency at the ESX level, during those periods the write size of the IO is high 128KB+. Read size is variable, but large read sizes with tiny writes doesnt have the same impact on latency.
From what Ive read writes that size are going to be bypassing FASTCache ? Is this correct?
What can I do to improve performance with the large write size coming through ?
Oh yes.. critical high profile project being affected - as always!
Fast Cache tracks data temperature at granularity of 64KB, so it should not be trying to promote anything. Is this server using vmdk/VMFS or RDM. Did you look at drives that make up the pool, are they busy (in terms of IOPS) ?
Hi dynamox, the drive is a VMDK, on a VMFS5 volume. It is the only VMDK on the LUN ~1.25TB in size and is presented only to that guest. We use the PVSCSI virtual adapters for our SQL Server guests as well.
The Pool is very large, 25 x 200GB EFD, 70 x 600GB 15K SAS, and 70 x 2TB NL-SAS - All are in RAID5(4+1) as it was first built 2.5 years ago on Flare 31 (and before best practise said to limit max disk in pools)
I havent braved analyzer and 165 disks in a single chart so far.. I assumed that the load would have been evenly spread.
LUN tiering profile is 2.8% / 88.6%/ 8.52% with a policy of autotiering.
Disk IO stats for the pool - very rough:
EFD Tier : IOPS peak at about 512/s max utilisation 31%, avg ~16%
SAS Tier: IOPS avg about 90/s, overall peaks up to 200/s. I have one disk peaking at 343/s. Utilisation averages about 30% and peaks about 62% (PS: for the entire 8 hour window Im analysing utilisation for all SAS disks never once drops below ~15%)
NLSAS Tier: I can see one HOT Raid group, its peaking about 70% utilisation and the rest are at 14%. IOPS for the HOT RG are up to 200/s and for the rest 50/s or lower.
So if I have one HOT RG in a pool, how do I control data placement to spread the load? For a traditional LUN I would move that to a more appropriate RG or disk type but with a pool I dont have that control.
i don't think your pool is that big, mine has a lot more drives. I personally concentrate on IOPS and response time ..not as much on utilization. As long as you have a least one I/O ..disk will report as being utialized. SAS tier ..are those 10k or 15k drives ? Maybe they are being worked a little too hard and time to add more drives to the pool ?
SAS are 15K rpm, and yes they are very busy - more than I expected.
We are reaching physical limit of disks for the VNX, I could maybe get another 2 RAID groups in there but we are due for a refresh next year so there may be some reluctance to spend $$s in the short term.
Would changing the autotiering policy to Highest first, then auto make a difference ?
Can you share the NAR file? Maybe I can generate some performance analysis report for you. You can also submit a case to EMC support if it's a real case.
Using the "high then auto" policy only means that new slices will be obtained from the highest tier possible and then will be auto-tiered to the correct tier (as deemed appropriate by FAST on the array) at the first opportunity. If this LUN is not constantly getting new slices that may not help much. It probably wouldn't hurt either but it most likely wouldn't help much more than just having it set to "auto-auto".
Have you noticed any other signs of bad performance - forced flushing, queue full counts on the SP ports, high SP utilization, etc.?
In response to questions:
I pulled yesterdays nar file, and see high response times all day on this LUN (min 4ms, max 50ms, avg about 20ms). No specific correlation to any other activity eg IO, Bandwidth, ABQL - they all fluctuate but not in line with the response times
Checked several other SQL LUNs on the same array including on of our "beast" applications and for the same period they are avg 1-2ms, peaks at 3 - 4ms.
SP utilisation sits about 50% during the day and there is no evidence of dirty pages being abnormally high. 2 spikes to 90% for the whole day, water marks are 50% & 70%.
No forced flush stats on the LUN because its in a pool (frustrating!)
@Roger - thanks for the offer , I will be opening a case with EMC, also our local EMC have a good performance guy and I will chat to him and see if he has any ideas.