[VNX] storage pool - disk showing utilisation 100% and response time high

Question

Hi,

I'm having an issue with a RAID-5 (15 disk with 15K SAS 600Gb) pool at the moment. and a single datastore is mounted on esxi hosts from a single LUN out of the pool.

It has been giving me a hugh latency and lost access to volume to esxi hosts and instantaneously restore the access every minute.

So I then decided to move some of VMs off the datastore (contains 35 VMs) to a less constraint datastore, which seems to alleviate the issue just a bit.

I'm going over the NAR file via Analyzer and it seems that two of the 15 disks are actually showing 100% utilization and response time is off the roof, around 700 - 800 ms. However it's not really doing much of throughput.

I'm not too sure if this is normal as there's some background process going on. But I don't see this issue from any other storage pools in the environment.

Can someone please share some ideas?

Thanks,

John

ahmed_allam · Answer

Hi,

if I got you right:

1- You have 1 Storage Pool with 15K SAS drives (Raid-5 4+1)

2- Only a single LUN is currently configured on that pool

3- you are only experiencing this behaviour on this specific LUN.

I would check if this LUN is trespassed from its owner SP. Next, I would ask you if this behaviour is persistent no matter which VMs are present on this LUN or is it only with specific VMs?

was this pool expanded, or was it created with 15 SAS drives right from the start? do you have FAST enabled on that pool ?

sgtk · Answer

Hi, No it hasn't trespassed. Thanks, John

ahmed_allam · Answer

Ok, have you checked if the LUN is trespassed from its owner SP? Regards | Mit freundlichen Grüßen Allam

ahmed_allam · Answer

Can you perhaps share a NAR file that captures the issue? Regards | Mit freundlichen Grüßen Allam

sgtk · Answer

1. yes that's correct. single 6TB LUN from this storage pool. FAST Cache is not enabled.

2. I have 35 VMs running on this LUN. This pool was built with 15 drives. Pool is only a single tier, which is SAS 15K.

3. LUN is with SP A, which is allocation owner.

There's been a bit of shuffling on this problematic datastore as people started complaining about performance and delay.

But from what I see on Analyzer, it doesn't really push much of IO during the day, roughly 500-800. Individual disk pushes about 100-120 throughput when it's really busy. Although response time for each disk seems quite high. Some disks are randomly pushing up to 100-150 ms or more.

Thanks,

John

sgtk · Answer

Sure, thanks for your help!

dynamox · Answer

drives that belong to 'SQL_DB' are hammered around 10:20am and 10pm, are those backups or some kind of batch job ? What LUNs are presented to your ESXi boxes ?

ahmed_allam · Answer

Refer to the snapshot. As @dyanmox mentioned, that LUN gets heavily hammered around the same time every day.

The timestamps in this plot is GMT+2. if you have 15 SAS drives and every one dives 160IOPS, at 100% read workload you'd only get 2,400 IOPS max and according to the below plot it is being driven to over 4000IOPS so definitely the Queue and the RT will shoot up.

Capture.PNG.png
If it's backup that runs at night (outside business hours) I wouldn't worry about it. but if it's running during business then that's deinitely explains your issue.

sgtk · Answer

Hi,

Backup and virus scan kicks off around 10 pm.

NAR is only 3 days data, during which I had to shuffle VMs around the 6 datastores which you can see from NAR (each LUN from each storage pool). I believe 10:20 am was the time I had to storage vmotion one of demo oracle servers.

I know it's not ideal at the moment, not really given much thought to queue and all. I'm waiting on another tray to fix this issue.

Funny thing is, This "lost access to volume" and "successfully restored access to volume" are constantly showing up on this problematic "SERVERS" datastore, which led me to ask whether the two disks that show ridiculously high response time might be the culprit.

sgtk · Answer

I see what you mean. I believe we don't really "use" FAST relocation, thinking that single tier won't do much good in OE 30.

I guess I should think about scheduling the FAST as a test. Although I'm not too sure whether enabling the FAST on a certain constraint LUN would cause more headache.

Thanks again!

ahmed_allam · Answer

Also note that FAST doesnt just move data between tiers. It also moves data within the same tier based on the tmeperature of the slices of data on the storage pool to evenly balance the load on all the spindles within the pool. Without FAST the data is balanced on the spindles based on capacity only and not perfroamance.

This could explain why spcific spindles in the pool are being hammered more than others in a normal workload.

sgtk · Answer

Hi,

Thanks for your kind reply.

When I looked closely to the two busy time slot, I realised each disk in this LUN was processing roughly 70 IOps. I may have misinterpreted the data though. btw I'm just working whether it is possible to see a couple of disks spinning like crazy and other disks in the same pool show very lower response time?

dynamox · Answer

your VNX is running either OE 31 or 32, in-tier FAST requires OE 32 and FAST license.

sgtk · Answer

Thanks. I'm pretty sure my VNX is running OE 31. I guess upgrade is inevitable. I guess I should consider rearranging virtual machines when I get a new tray (25 x 10K SAS 2.5in disks) Thanks Dynamox.

sgtk · Answer

Hi dynamox,

I understand this LUN is under constraint. The LUN I'm having trouble with at the moment is SERVERS - LUN0, which doesn't show much throughput going on during the day, around 700-900.

And SQL is running on RAID10 with 16 disks, so i'm guessing it's overshooting at the moment.

I'll double check on vmkernel.log and see what's causing the issue.

Thanks,

John

VNX

[VNX] storage pool - disk showing utilisation 100% and response time high

Was this post helpful?

Register Now! for Secure Secrets with Kubernetes & DevSecOps - and grab a lab!