We are trying to trouble shoot a problem relating to a mainframe pool using SATA drives and are pretty convinved the issue is lack of cache and physical response time. What is odd is that we do not see write pending limits on the devices in question being reached, but when using the writes per sec and write hits per sec statistic we can see a pretty big gap (say 10% of the total) The job in question is 100% write - there are not reads going to the volume during the problematic time frame.
We do see the write pending limit on the device approach about 80% of the write pending threshold and level off before dropping. We've confirmed that the physical drives seem to be the issue by moving the problematic drives away from the SATA volumes to FC volumes and still see the same volume of IO with faster completion time.
This seems to pretty conclusively show we are hitting the physical drives for these writes - trying to understand how that can be the case when WLA pending never seems to hit the threshold. Am I missing something?
This is a DMX-4 running 74 microcode.
Write "misses" can be logged for other reasons than WP limits being reached.
The best way to determine if volume limits are being reached is to have someone dial into the box and look at the A1,BA page and look to see if the task 8 counter is increasing, or if there is a delay being issued on the device just before the limit is reached (A7,DWMP)
Also I assume you mean 5773 code, not 74 code.
Yes you are correct I meant to say we are on the recent version of 73 microcode. We do have EMC engaged so I'll forward what you suggested.
So is it safe to assume that the writes - write hits = writes that are waiting on the physical drive instead of being serviced by cache? Is there something else via WLA or something else that we can check that would confirm a delay.
% writes 100 * (writes per sec / total ios per sec)
Percentage of total write I/O operations performed by all of the Symmetrix devices.
deferred writes per sec A deferred write is a write hit. A deferred write occurs when the I/O write operations are
staged in cache and will be written to disk at a later time.
delayed dfw per sec A delayed deferred fast write (DFW) is a write-miss. A delayed DFW occurs when the I/O
write operations are delayed because the system or device write-pending limit was reached
and the cache had to destage slots to the disks before the writes could be written to cache.
The odd thing is that we don't see any deferred writes or delayed writes, but we are seeing write hits NOT aligned with writes.
I would expect if cache is handling all of your writes that write hits per sec = writes per sec which in our case it does not.
I would expect when they don't match that you would show deferred writes and/or pending tracks.
yeah I would expect your logic to be correct
however experience has taught me these symm metrics aren't so straight forward and numbers are counted somewhere else
1) are you seeing write misses against these devs?
2) on the FA/EF processor are you seeing "device write pending events" > 0
3) on the FA/EF processor are you seeing "system write pending events" > 0
can you please email me the SR number, my email is in my profile info, thanks
Sometimes counters are based on requests, which are accesses to cache slots. So if an IO spans a slot, it can be two requests.
Write misses can be attributed to other reasons than WP limits being reached. A slot being locked is one example.
If you are seeing WP counts very close to the limit, then there is a good chance the IOs could be delayed because of this. We can start issuing delays on the front end before the WP limit is reached.
The counter on the front end for WP events is a good place to look as mentioned.
No to questions 2 & 3 - there are no obvious signs of WP events at the adapter or system level. The only oddness we see relating to write pending is some of the busy devices getting close to the limit and plateuing, but never hitting the write limit threshold.
Yes on write misses. An example of the discrepency - during the period in question (about 30 minutes from 9PM on) we have at the System metric level about 4000 writes per sec with less than 1000 write hits per second.
What is odd is that we don't necessarily see write misses, but we see write pendings get to about 95% of the write pending threshold.
We are waiting a final answer back from EMC engineering, but looks like our DA paths were hitting saturation point. For some reason WLA was unable to show that fact with the data that it can parse though with Symmerge the performance GURU was able to show these ports being saturated.
WIll update once we receive an answer is to why this is the case.
It can take a lot of SATA drives to saturate a DA. Can you upload your BTP file to the FTP area so I could take a look at it? And the bin file as well if you have it.