VNX Cache issues

Hi there,

I regularly see VNX gen 1 and to a lesser degree gen 2 systems showing high cache usage pushing into forced flushing usually without high controller usage. This usually is seen with high utilization at the LUN level even if the underlying disks aren't showing a high workload (note these are not thin luns).

So my question is, would increasing the capacity of the lun help reduce the lun utilization? I can't see why that would help as most of these system are using auto-tiering but maybe I'm missing something.

Any advice would be appreciated.

Thanks,

Ed

Responses(3)

kelleg

4.5K Posts

2

April 10th, 2017 13:00

On VNX1 arrays to the best practice is to assign the maximum about of memory to Write cache and what's left to Read cache. Also, the metric called % Dirty Pages is a good indicator that the amount of Writes coming into the array is exceeding the capability of the underlying disks to handle the workload. You will need to determine which LUNs are causing this - look for those times when SP Dirty Pages hit 99%, then see which LUNs have the most Write IOPS at that time, then look at the number of IO/sec (IOPS) that the disks are getting. If the disks are overloaded, then you'll need more disks for the LUN. If the LUN is in a Raid Group on VNX1 you should still be able to expand the Raid Group. If you have a 4+1, you should be able to add 4 more of the same disks to the RG to make it a 8+1 (or you could use metaLUNs).

VNX2 does not have the concept of Dirty Pages as System cache is dynamic and adjusts to workload - sometimes you have more Write cache and sometimes you could have more Read.

See KB 473729 for more information.

glen

edhoward

214 Posts

0

April 10th, 2017 14:00

Thanks for the reply, I've checked the SP collect and cache is set correctly, everything is sat in a single FAST VP pool and all the LUNs are set to auto. Mitrend shows loads of cache dirty pages and follows the forced flushing. What is confusing me is that as all the Luns are in an auto tiering pool so why a single RAID group in the pool is showing high utilisation compared to the others just looks odd. Unless as it was a relatively short sample means that it was an isolated peak.

Actually thinking about it that might be the problem, I'll go get some more NAR files...

kelleg

4.5K Posts

0

April 11th, 2017 08:00

What model array and Flare version is running on this array? In Flare 32 for VNX1 Auto-tiering was enhanced by adding in-tier re-balancing. This allows the array to re-balance the slices over all the disks within a particular tier. That should help even out the workload on the private raid groups within the Pool. But there are cases when a single raid group could have higher IOPS than the other RG's in the same tier and that could cause drive overloading which could cause the Write cache to hit 99% Dirty Pages which causes Force Flushing. This can happen when the IO is hitting a single RG for short periods of time, the slices are marked as HOT at that point, but over time will grow less HOT (aged) and may not re-balance when Auto-Tiering runs at night.

You could try running Auto-tiering for 23 hours and 45 minutes at LOW setting and see if that helps spread the workload over more of the private Raid Groups within the Tier within the Pool.

glen

View All

No Events found!