Write times and Write pending correlation

Question

Hello guys,

i am trying to make sense out of this situation. Why during the time when i have over 20000 write pendings (9:15am), average write times is good (less than 20ms), yet when write pendings are very low (around 10:15am), i have almost 90ms write times. Write pending threshold is around 55000 per devices (5% of total system write pending slots). This is 100% write workload. Any clues ?

10-19-2010 4-34-56 PM.png

Quincy561 · Answer

When the backend can't keep up with the frontend, the WP counts get high. If you are seeing low WP counts along with poor write response times, I would suspect a frontend bottleneck.

Boom1 · Answer

it could be possible.. You are flushing your cache and you have low WP so your backend is good. but you have High Write response time so your Frontends are not pushing enough data in cache or backend.

dynamox · Answer

If you are seeing low WP counts along with poor write response times, I would suspect a frontend bottleneck.

i don't think i am following you. If i have very low wp counts why would i have very high write times in symmetrix (as seen in that chart) ?

Quincy561 · Answer

I was suggesting that the writes can't get into the system fast enough because you don't have enough FA resources.

dynamox · Answer

Quincy,

i could be looking at it all wrong, please correct me. When i look at chart with "Sampled average write times", this is from FA perspective ..how long it took for it to send request to the system and get it back. So if write pendings are low, there is nothing sitting in cache ..why would FA wait for so long.

Thank you

Quincy561 · Answer

I'm suggesting that you are writing to the FA faster than it can deal with the IOs, so it has to queue them.  I would suggest adding FA CPUs (not just ports) and see if the write response time gets better.

dynamox · Answer

this device is mapped to four FAs ..you tell me ..but they don't seem to be busy to me (these are CPU stats).

dynamox · Answer

could be, but if you look at 'sample average write time (ms)' ..it stays very high for almost an hour.  These devices are configured for SRDF/S, what metrics can i look in Performance Manager to see if SRDF could have anything to do with that ?

Quincy561 · Answer

No, they don't look busy over that sample interval, but what about during a one or two second sample?

We have seen many cases where writes burst for just a very short period of time, then stop. During that short period of time they can overwhelm the front end causing high response times.

It still could be something else such as a SAN problem or slot contention.

dynamox · Answer

we have a case opened but PSE kicked it back to local support and it takes local guys forever to setup STP and all that stuff. I am paying boat load of money for ECC/Performance Manager so trying to research myself as well. Thanks for suggestions so far.

Quincy561 · Answer

Then it could be slot locks or something else. One quick way to determine if RDF is causing the issue is to suspend the RDF for a breif period and see if the response times improve.

I would also suggest opening a case with CS if this is really an issue for you, rather than trying to troubleshoot it here.

Quincy561 · Answer

Any host with the API/CLI installed can start and collect STP data via the stordaemon. You can set it up to collect up to 1 minute samples.

Also you might want to run a symstat -i 5 and see if it shows any bursting of IO.

And again if you can suspend RDF for a short time to see if performance improves greatly, that would rule out the FA.

dynamox · Answer

we have put SRDF in adaptive copy mode, response times have improved but still higher than usual. We might suspend it for a little while and see what happens. In terms of IOPS this host is doing the same thing that it did a week ago (according to performance manager), the only thing that changed in the environment we started replicating from a new VMAX to DMX3 (thin LUNs). This DMX3 is also a SRDF target for this application that resides on DMX4 (regular thick LUN replication).

dynamox · Answer

if target SRDF box is having problems distaging from cache to disk, what items would indicate that ?

Quincy561 · Answer

Putting RDF into adaptive copy mode should have removed the RDF impact to the writes.

Symmetrix

Write times and Write pending correlation

Was this post helpful?