Start a Conversation

Unsolved

This post is more than 5 years old

4605

November 16th, 2010 12:00

Dirty Pages Question

What is the best way to determine which LUN or RAID group is causing our Write Cache to fill up?  We are constantly hitting 99 at different periods during the day, and I am trying to troubleshoot what is the primary suspect.

542 Posts

November 16th, 2010 12:00

If you have analyzer installed, you can check each lun to see what the write cache hit ratio is.  There are a bunch of data points about write cache but you will need to enable the advance settings for analyzer to see them.

What are your watermarks set at?  Usally it is set at 60/80.  So at 80%, it should be force flushing and not getting to 99%

8 Posts

November 16th, 2010 13:00

Ours was lowered from the default to 40 to 50. We have about 180 or so LUNs.  What would be considered a poor Write Cache ratio?  Just glancing at the nar file I have uploaded most are at a 100% with a few that dip pretty low sometimes, but nothing sustainable.

8 Posts

November 16th, 2010 13:00

Yeah I have been looking over some of the advanced stuff today.    I liked the forced flush metrics as well as the write cache miss /etc.  The problem is that none of these tell me what exactly is filling the cache, only the LUNs that seem to be most impacted when it becomes full if that makes sense.  

131 Posts

November 16th, 2010 13:00

Open a .nar file.  Open performance detail.  Right click over the array in analyzer -> select all LUNs.

Look at write cache hits/s and forced flushes/s for the LUNs - You'll probably need to turn on advanced settings in navisphere analyzer to see those values.

You'll get a bunch of lines strung together, but you'll also notice some that are towering above the others at the times when write cache is at 99%.  Remove all LUNs from your analyzer view and just select those that were towering above everyone else.  Those LUNs are probably your best bet.

Also.  Write throughput and write bandwidth/s might help.

131 Posts

November 16th, 2010 14:00

Write cache hits/s, not misses.  You want to see what's using it all.  If you see something towering over everything else, then that's probably the one causing it.  Once you narrow it down to a few LUNs, then drill down and look further.  See what the disks are doing.  Or you could disable write cache for a LUN you suspect and see if your write cache values get better.

It's usually incredibly obvious when a LUN is eating all of your write cache.  The ones writing/flushing the most data are usually a good place to look.  If you look at every LUN on the same graph for these metrics, then you'll probably see one or two LUNs with much higher rates than the others.

Then once you find the LUN, it's probably time to either put it on more disks, move it to RAID10, or yell at someone.

4.5K Posts

November 16th, 2010 14:00

If you have less than about 400 disks on the array than you can select all DISKS and look at Total IOPS - select out the ones that are the highest.

15K FC disks any that have IOPS over 180 IOPS

10K FC disks any that have IOPS over 120 IOPS

SATA disks that have IOPS over 80 IOPS

These are the ones to look more carefully at.

When Dirty Pages hit 99% you have a real issue on the array with Write IO to LUNs that the disks can not keep up with the writes. A good bet are disks where the Total IOPS and Write IOPS are very high - over the limits above. Simple test is to disable the Write cache on a busy LUN and see if Dirty Pages decrease - this will also really slow down that LUN so be careful. We see this most often with SATA disks. Sometimes with SAN Copy from FC disks to SATA disks, LUN Migration from FC to SATA, etc.

glen

313 Posts

November 16th, 2010 15:00

The easiest place to check for oversaturation of drives is the RAID Group queue length vs. average busy queue length. If either is higher than 10 I/Os, you will likely see slower rTimes for those LUNs. Search in Navisphere Manager Help for great definitions of each of these statistics.

313 Posts

November 16th, 2010 15:00

I would add to Glen's point just one other significant factor: the saturation levels he provided are true for each disk type but with the caveat that those I?O are small block (less than or equal to 8KB). Large I/O will cause more traffic through the system and therefore can cause write cache saturation to occur as well.

Right click on the LUN and go to I/O size distribution to see the exact spread of request sizes.

You can also see compare the write size at the LUN level vs. the Disk level to see how well the system is coalescing the requests. If you see largerrequests at the disk level, in most cases you should see improved bandwidth but lower levels of sustainable thoughput (exception being sequential reads).

There is plenty more to consider, but the long story short is that I/O size is an important factor when considering WC utilization.

8 Posts

November 17th, 2010 07:00

Thanks everyone, I'll take a look at all of your suggestions.

2 Intern

 • 

5.7K Posts

November 17th, 2010 07:00

60/80 means the following:

  • when between 60% and 80% filled write cache, flushing will have normal priority.
  • when above 80%, forced flushing will start, which is a high prio. Below 80% prio is dropped slightly, but flushing will get a low prio when the 60% is reached.
  • when below 60% flushing is a low prio task

It's not a problem when write pendings reach values above 80%, they sometimes can even reach 99%, but when values above 80 are reached, forced flushnig will start.

If you up the values to 90/70 for example, the forced flushing will be over more quickly, but it will occur more often also.

The more data is in cache, the better it is for performance. When the H/L watermarks are set too low, less data will actually be in cache and more flushing will occur, which probably will not be good for overal performance.

the default of 60/80 is the best settnig in most cases.

313 Posts

December 7th, 2010 14:00

RRR's post reminded me to say it directly --- you can go to Select all LUNs > Force Flushes to see which LUNs are busy during the period of write cache full. Seeing a particular LUN with a high quantity of FF doesn't necessarily mean it is the only cause of heavy WC saturation, but it helps in the isolation.

2 Intern

 • 

1.3K Posts

December 11th, 2010 10:00

array, analazer, performance details and then LUN/SP, But i am not able to see the values for metric selected. What is the 'advanced settings` in Navi Analyzer  to see those values displayed??

4.5K Posts

December 13th, 2010 09:00

In Navisphere, go to Tool/Analyzer/Customize - on the General tab, there is an Advanced check box - check that.

Glen

2 Intern

 • 

1.3K Posts

December 13th, 2010 12:00

will there be a performance hit by this?

4.5K Posts

December 13th, 2010 15:00

Not for just enabling the Advanced option - this is only for displaying data that has already been collected in the NAR - you also might what to check out the Help in Navipshere - for Analyzer - it goes into more detail about using the Advanced feature.

glen

No Events found!

Top