You can refer to EMC KB emc140166 "Understanding response time for a CLARiiON CX300 array connected to server running an Oracle database" for some information. The table itself comes from internal and is just for our reference when working on performance cases. But some customers have the similiar table too:
FLARE has a condition called Forced Flushing. It occurs when the percent count of dirty cache pages crosses over the high watermark and reaches 100%. At that point, the cache starts forcefully flushing unsaved (dirty) data to disk, suspending all host IO. Forced flushing continues until the percent count of dirty pages recedes below the low watermark.
Forced flushing affects the entire array and all workloads served by the array. It significantly increases the host response time until the number of cache dirty pages falls below the low watermark. The Storage Processor gives priority to writing dirty data to disk, rather than allocating new pages for incoming Host IO. The idea of high and low watermark functionality was implemented as a mechanism to avoid forced flushing. The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur.
So why "the SP Dirty Page% occasionally can reach 95%"? Because there are too many inbound IOs and the backend disks might have been overloaded thus the cache doesn't have enough time to write them to the disks.
Thanks, but why is it that when my SP High Water mark is set to 80%, the SP Dirty Page% occasionally can reach between 91% to 95% for a duration of 10 minutes before it's flushed to its backend disks?
Does it mean that the drive spindle at that particular time is not quite capable of handling the workload generated by the RHEL VM TAR job to the SAN VM Disk?
I am planing to lower the high and low water mark from its current 60~80% to 50~70% to avoid the SP Cache Dirty Page rising occasionally to above 90%
However, from my Analyzer output, I found that my Fast Cache Write Hit Ratio always stays at 0.250 (25%) when the Dirty page is normally fluctated between its 60~80%, and it's only when the Fast Cache Write Hit Ratio is dropped to below 0.125 (12.5%) for a longer period of time about 30 minutes, then the Dirty Page then rise to above 90 to 95%.
Do you think that's safe to have the water mark lowered to 50~70%? I am worrying that the lowering of the water mark might have some bad impact on the whole system performance.
Before exceeding the low watermark, there are only a few writes in a cycle; Once exceeding the high watermark, there are more writes in a cycle. So usually "The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur". But I can image that it's quite difficult to find a "balance point". There should be no impact if you adjust the high/low watermark, you can have a try.
1.) The FASTCache hit ratio is not the same as the SP Cache hit ratio.. So you could have 25% FASTCache hit ratio and 99% SP Cache hit ratio at the same time, or any other values for that matter.
2.) as far as getting actual SP Cache hit ratio, it is definitely tracked on the LUN, not at the SP, however pools make it more difficult as the caching is handled on the private LUNs that make up the pool, no the user LUNs inside the pool. So many of the statistics for pool LUNs are unavailable.
3.) As already mentioned, lowering from 60/80 to 50/70 will likely make very little difference in overall performance and is safe to try. I have a situation with a customer where the watermarks are 15/25 because of a horrendous workload. In their case it helped, but I would not advise going that low.
4.) Cache hit ratios and watermarks are proportional, ie: the lower the watermarks, the lower the cache hit ratio. How much lower depends entirely on the workload. Since you also have FASTCache, it may make little to no difference.
You note that FASTCache hit ratio drops at about the same time as dirty pages spikes. This indicates to me that there was a shift in the workload characteristics during that time. You said it was a RHEL Tar dump, which make sense. The server is writing a large amount of data in a short time, filling up the cache as it does. Then the array flushes the data in the background. Looking at your chart of Dirty pages % over time, I see only one spike in many hours. If it were my system, I'd leave the array alone and see if there's any way to tune the job to run a bit slower, or just ignore it for now. If you are not seeing application response time problems, and all you see is the dirty pages spike once a day or so, it's likely not an issue you should be spending time trying to solve.
Yes, you are right. The TAR gzipped file is about 12.6 GB and I can see that the TAR job was initially having more Read than Write because of the reading of the directories and files, and then eventually it will generate more and more write to the TAR gzipped file, and I guess this has given the spike on the Dirty Page%.
I would like to know is there any way to have the so-called System Buffer minimized to a smaller portion of the SP Memory Size of 4096 MB. Right now, it's occupying 3396 MB and this leave only 727 MB for Read and Write Cache. I am thinkinng if I can double the size of the Read and Write Cache, then it could spare more spaces to Read & Write cache and subsequently lower the Dirty Page%.
Do you know if I decrease the water mark significantly to a lower level, what should I observe from the NAR file that performance has become better or worse?
Is it by checking the LUN Response Time and also checking to prevent LUN Utilization becoming too high?
Roger_Wu
4 Operator
•
4K Posts
0
December 29th, 2013 21:00
Usually we use the following default thresholds:
So an alert for 95% dirty pages is Ok.
Ed La
1 Rookie
•
77 Posts
0
December 29th, 2013 23:00
Thanks for the reply.
But I don't quite understand the above table of :-
1) Disk response time > 15ms if total iops > 20
2) Lun response time > 22ms if total iops > 20
Does it mean the Disk and LUN response time has to be > 15ms and 22ms respectively, when the total iops > 20, otherwise, it's abnormal ???
Roger_Wu
4 Operator
•
4K Posts
0
December 29th, 2013 23:00
You can refer to EMC KB emc140166 "Understanding response time for a CLARiiON CX300 array connected to server running an Oracle database" for some information. The table itself comes from internal and is just for our reference when working on performance cases. But some customers have the similiar table too:
What metrics can I use to determine when my Clariion is at the maximum performance level?
Roger_Wu
4 Operator
•
4K Posts
0
December 29th, 2013 23:00
Yes, then we need to figure out why it exceeded the thresholds. But sometimes it is acceptable, e.g. during file copy.
Please mark my answer as correct/helpful answer if it helps.
Ed La
1 Rookie
•
77 Posts
0
December 29th, 2013 23:00
Do you get this table from some EMC documents?
Roger_Wu
4 Operator
•
4K Posts
1
December 30th, 2013 00:00
To answer this question, let's look into the whitepaper first:
http://www.emc.com/collateral/analyst-reports/h12090-emc-vnx-mcx.pdf
FLARE has a condition called Forced Flushing. It occurs when the percent count of dirty cache pages crosses over the high watermark and reaches 100%. At that point, the cache starts forcefully flushing unsaved (dirty) data to disk, suspending all host IO. Forced flushing continues until the percent count of dirty pages recedes below the low watermark.
Forced flushing affects the entire array and all workloads served by the array. It significantly increases the host response time until the number of cache dirty pages falls below the low watermark. The Storage Processor gives priority to writing dirty data to disk, rather than allocating new pages for incoming Host IO. The idea of high and low watermark functionality was implemented as a mechanism to avoid forced flushing. The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur.
So why "the SP Dirty Page% occasionally can reach 95%"? Because there are too many inbound IOs and the backend disks might have been overloaded thus the cache doesn't have enough time to write them to the disks.
Ed La
1 Rookie
•
77 Posts
0
December 30th, 2013 00:00
sorry, could be a stupid question, but I can't find the Forced Flush count in the NAR file.
Ed La
1 Rookie
•
77 Posts
0
December 30th, 2013 00:00
Thanks, but why is it that when my SP High Water mark is set to 80%, the SP Dirty Page% occasionally can reach between 91% to 95% for a duration of 10 minutes before it's flushed to its backend disks?
Does it mean that the drive spindle at that particular time is not quite capable of handling the workload generated by the RHEL VM TAR job to the SAN VM Disk?
Roger_Wu
4 Operator
•
4K Posts
0
December 30th, 2013 01:00
Some metrics are only available in advanced mode:
You can refer to its online help manual for the details:
dynamox
9 Legend
•
20.4K Posts
0
December 30th, 2013 08:00
Ed La
1 Rookie
•
77 Posts
0
January 5th, 2014 23:00
hi thanks.
I am planing to lower the high and low water mark from its current 60~80% to 50~70% to avoid the SP Cache Dirty Page rising occasionally to above 90%
However, from my Analyzer output, I found that my Fast Cache Write Hit Ratio always stays at 0.250 (25%) when the Dirty page is normally fluctated between its 60~80%, and it's only when the Fast Cache Write Hit Ratio is dropped to below 0.125 (12.5%) for a longer period of time about 30 minutes, then the Dirty Page then rise to above 90 to 95%.
Do you think that's safe to have the water mark lowered to 50~70%? I am worrying that the lowering of the water mark might have some bad impact on the whole system performance.
Roger_Wu
4 Operator
•
4K Posts
0
January 6th, 2014 00:00
Before exceeding the low watermark, there are only a few writes in a cycle; Once exceeding the high watermark, there are more writes in a cycle. So usually "The lower the high watermark, the larger the reserved buffer in the cache, and the smaller chance that forced flushing will occur". But I can image that it's quite difficult to find a "balance point". There should be no impact if you adjust the high/low watermark, you can have a try.
Storagesavvy
474 Posts
1
January 6th, 2014 09:00
Couple of things...
1.) The FASTCache hit ratio is not the same as the SP Cache hit ratio.. So you could have 25% FASTCache hit ratio and 99% SP Cache hit ratio at the same time, or any other values for that matter.
2.) as far as getting actual SP Cache hit ratio, it is definitely tracked on the LUN, not at the SP, however pools make it more difficult as the caching is handled on the private LUNs that make up the pool, no the user LUNs inside the pool. So many of the statistics for pool LUNs are unavailable.
3.) As already mentioned, lowering from 60/80 to 50/70 will likely make very little difference in overall performance and is safe to try. I have a situation with a customer where the watermarks are 15/25 because of a horrendous workload. In their case it helped, but I would not advise going that low.
4.) Cache hit ratios and watermarks are proportional, ie: the lower the watermarks, the lower the cache hit ratio. How much lower depends entirely on the workload. Since you also have FASTCache, it may make little to no difference.
You note that FASTCache hit ratio drops at about the same time as dirty pages spikes. This indicates to me that there was a shift in the workload characteristics during that time. You said it was a RHEL Tar dump, which make sense. The server is writing a large amount of data in a short time, filling up the cache as it does. Then the array flushes the data in the background. Looking at your chart of Dirty pages % over time, I see only one spike in many hours. If it were my system, I'd leave the array alone and see if there's any way to tune the job to run a bit slower, or just ignore it for now. If you are not seeing application response time problems, and all you see is the dirty pages spike once a day or so, it's likely not an issue you should be spending time trying to solve.
Ed La
1 Rookie
•
77 Posts
0
January 6th, 2014 18:00
Thanks for the detail explanation.
Yes, you are right. The TAR gzipped file is about 12.6 GB and I can see that the TAR job was initially having more Read than Write because of the reading of the directories and files, and then eventually it will generate more and more write to the TAR gzipped file, and I guess this has given the spike on the Dirty Page%.
I would like to know is there any way to have the so-called System Buffer minimized to a smaller portion of the SP Memory Size of 4096 MB. Right now, it's occupying 3396 MB and this leave only 727 MB for Read and Write Cache. I am thinkinng if I can double the size of the Read and Write Cache, then it could spare more spaces to Read & Write cache and subsequently lower the Dirty Page%.
Ed La
1 Rookie
•
77 Posts
0
January 6th, 2014 23:00
Do you know if I decrease the water mark significantly to a lower level, what should I observe from the NAR file that performance has become better or worse?
Is it by checking the LUN Response Time and also checking to prevent LUN Utilization becoming too high?