VNX 5100 SP Cache Dirty Page% Monitoring

Question

I am trying to determine a health status monitoring of the SP Cache Dirty Page%. My water mark is now 60% ~ 80% and I noticed that occasionally during a RHEL VM TAR job to its SAN VM Disk, the Dirty Page% rises to above 80% to 95% for a short duration of 10 minutes. The manual said that a high dirty page% can speed up performance but I don't want it to be too high as 100% which means that an IO bottleneck has been reached and that can seriously affect performance. I know that a 100% Dirty Page is unhealthy, but how should I determine under what % is considered to be healthy? Should I set my monitoring of SP Dirty Pages% to above 95% for alerting?

Roger_Wu · Answer

Usually we use the following default thresholds:

FC	160 iops
EFD	2500 iops
ATA	70 iops
SATAII	90 iops
SAS	160 iops
NL SAS	120 iops
Dirty pages	95 %
Disk resp. time	>15 ms (if total iops > 20)
Average Seek Distance	10 % or > 30 GB/s
Lun response time	>22ms (if total iops > 20)
BE Bandwidth	>320MB/s or 2160MB/s for VNX

So an alert for 95% dirty pages is Ok.

Ed La · Answer

Thanks for the reply.

But I don't quite understand the above table of :-

1) Disk response time > 15ms if total iops > 20

2) Lun response time > 22ms if total iops > 20

Does it mean the Disk and LUN response time has to be > 15ms and 22ms respectively, when the total iops > 20, otherwise, it's abnormal ???

Roger_Wu · Answer

You can refer to EMC KB emc140166 'Understanding response time for a CLARiiON CX300 array connected to server running an Oracle database' for some information. The table itself comes from internal and is just for our reference when working on performance cases. But some customers have the similiar table too: What metrics can I use to determine when my Clariion is at the maximum performance level?

Roger_Wu · Answer

Yes, then we need to figure out why it exceeded the thresholds. But sometimes it is acceptable, e.g. during file copy. Please mark my answer as correct/helpful answer if it helps.

Ed La · Answer

Do you get this table from some EMC documents?

Roger_Wu · Answer

To answer this question, let's look into the whitepaper first:

http://www.emc.com/collateral/analyst-reports/h12090-emc-vnx-mcx.pdf

FLARE has a condition called Forced Flushing. It occurs when the percent count of dirty cache pages crosses over the high watermark and reaches 100%. At that point, the cache starts forcefully flushing unsaved (dirty) data to disk, suspending all host IO. Forced flushing continues until the percent count of dirty pages recedes below the low watermark.

Forced flushing affects the entire array and all workloads served by the array. It significantly increases the host response time until the number of cache dirty pages falls below the low watermark. The Storage Processor gives priority to writing dirty data to disk, rather than allocating new pages for incoming Host IO. The idea of high and low watermark functionality was implemented as a mechanism to avoid forced flushing. The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur.

So why "the SP Dirty Page% occasionally can reach 95%"? Because there are too many inbound IOs and the backend disks might have been overloaded thus the cache doesn't have enough time to write them to the disks.

Ed La · Answer

sorry, could be a stupid question, but I can't find the Forced Flush count in the NAR file.

Ed La · Answer

Thanks, but why is it that when my SP High Water mark is set to 80%, the SP Dirty Page% occasionally can reach between 91% to 95% for a duration of 10 minutes before it's flushed to its backend disks?

Does it mean that the drive spindle at that particular time is not quite capable of handling the workload generated by the RHEL VM TAR job to the SAN VM Disk?

Roger_Wu · Answer

Some metrics are only available in advanced mode:You can refer to its online help manual for the details:

Ed La · Answer

hi thanks.

I am planing to lower the high and low water mark from its current 60~80% to 50~70% to avoid the SP Cache Dirty Page rising occasionally to above 90%

However, from my Analyzer output, I found that my Fast Cache Write Hit Ratio always stays at 0.250 (25%) when the Dirty page is normally fluctated between its 60~80%, and it's only when the Fast Cache Write Hit Ratio is dropped to below 0.125 (12.5%) for a longer period of time about 30 minutes, then the Dirty Page then rise to above 90 to 95%.

Do you think that's safe to have the water mark lowered to 50~70%? I am worrying that the lowering of the water mark might have some bad impact on the whole system performance.

Roger_Wu · Answer

Before exceeding the low watermark, there are only a few writes in a cycle; Once exceeding the high watermark, there are more writes in a cycle. So usually "The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur". But I can image that it's quite difficult to find a "balance point". There should be no impact if you adjust the high/low watermark, you can have a try.

Storagesavvy · Answer

Couple of things...

1.) The FASTCache hit ratio is not the same as the SP Cache hit ratio.. So you could have 25% FASTCache hit ratio and 99% SP Cache hit ratio at the same time, or any other values for that matter.

2.) as far as getting actual SP Cache hit ratio, it is definitely tracked on the LUN, not at the SP, however pools make it more difficult as the caching is handled on the private LUNs that make up the pool, no the user LUNs inside the pool. So many of the statistics for pool LUNs are unavailable.

3.) As already mentioned, lowering from 60/80 to 50/70 will likely make very little difference in overall performance and is safe to try. I have a situation with a customer where the watermarks are 15/25 because of a horrendous workload. In their case it helped, but I would not advise going that low.

4.) Cache hit ratios and watermarks are proportional, ie: the lower the watermarks, the lower the cache hit ratio. How much lower depends entirely on the workload. Since you also have FASTCache, it may make little to no difference.

You note that FASTCache hit ratio drops at about the same time as dirty pages spikes. This indicates to me that there was a shift in the workload characteristics during that time. You said it was a RHEL Tar dump, which make sense. The server is writing a large amount of data in a short time, filling up the cache as it does. Then the array flushes the data in the background. Looking at your chart of Dirty pages % over time, I see only one spike in many hours. If it were my system, I'd leave the array alone and see if there's any way to tune the job to run a bit slower, or just ignore it for now. If you are not seeing application response time problems, and all you see is the dirty pages spike once a day or so, it's likely not an issue you should be spending time trying to solve.

Ed La · Answer

Thanks for the detail explanation.

Yes, you are right. The TAR gzipped file is about 12.6 GB and I can see that the TAR job was initially having more Read than Write because of the reading of the directories and files, and then eventually it will generate more and more write to the TAR gzipped file, and I guess this has given the spike on the Dirty Page%.

I would like to know is there any way to have the so-called System Buffer minimized to a smaller portion of the SP Memory Size of 4096 MB. Right now, it's occupying 3396 MB and this leave only 727 MB for Read and Write Cache. I am thinkinng if I can double the size of the Read and Write Cache, then it could spare more spaces to Read & Write cache and subsequently lower the Dirty Page%.

Ed La · Answer

Do you know if I decrease the water mark significantly to a lower level, what should I observe from the NAR file that performance has become better or worse?

Is it by checking the LUN Response Time and also checking to prevent LUN Utilization becoming too high?

VNX

Was this post helpful?