bodnarg

385 Posts

1317

June 10th, 2010 06:00

WLA writes per sec vs write hits per sec

We are trying to trouble shoot a problem relating to a mainframe pool using SATA drives and are pretty convinved the issue is lack of cache and physical response time. What is odd is that we do not see write pending limits on the devices in question being reached, but when using the writes per sec and write hits per sec statistic we can see a pretty big gap (say 10% of the total) The job in question is 100% write - there are not reads going to the volume during the problematic time frame.

We do see the write pending limit on the device approach about 80% of the write pending threshold and level off before dropping. We've confirmed that the physical drives seem to be the issue by moving the problematic drives away from the SATA volumes to FC volumes and still see the same volume of IO with faster completion time.

This seems to pretty conclusively show we are hitting the physical drives for these writes - trying to understand how that can be the case when WLA pending never seems to hit the threshold. Am I missing something?

This is a DMX-4 running 74 microcode.

Responses(12)

Quincy561

1.3K Posts

1

June 10th, 2010 07:00

Write "misses" can be logged for other reasons than WP limits being reached.

The best way to determine if volume limits are being reached is to have someone dial into the box and look at the A1,BA page and look to see if the task 8 counter is increasing, or if there is a delay being issued on the device just before the limit is reached (A7,DWMP)

Also I assume you mean 5773 code, not 74 code.

bodnarg

385 Posts

0

June 10th, 2010 11:00

Yes you are correct I meant to say we are on the recent version of 73 microcode. We do have EMC engaged so I'll forward what you suggested.

So is it safe to assume that the writes - write hits = writes that are waiting on the physical drive instead of being serviced by cache? Is there something else via WLA or something else that we can check that would confirm a delay.

bodnarg

385 Posts

0

June 10th, 2010 12:00

The odd thing is that we don't see any deferred writes or delayed writes, but we are seeing write hits NOT aligned with writes.

I would expect if cache is handling all of your writes that write hits per sec = writes per sec which in our case it does not.

I would expect when they don't match that you would show deferred writes and/or pending tracks.

Booyah2

184 Posts

1

June 10th, 2010 12:00

% writes 100 * (writes per sec / total ios per sec)

Percentage of total write I/O operations performed by all of the Symmetrix devices.

deferred writes per sec A deferred write is a write hit. A deferred write occurs when the I/O write operations are

staged in cache and will be written to disk at a later time.

delayed dfw per sec A delayed deferred fast write (DFW) is a write-miss. A delayed DFW occurs when the I/O

write operations are delayed because the system or device write-pending limit was reached

and the cache had to destage slots to the disks before the writes could be written to cache.

J

JasonBailey

147 Posts

0

June 10th, 2010 15:00

yeah I would expect your logic to be correct

however experience has taught me these symm metrics aren't so straight forward and numbers are counted somewhere else

some questions:

1) are you seeing write misses against these devs?

2) on the FA/EF processor are you seeing "device write pending events" > 0

3) on the FA/EF processor are you seeing "system write pending events" > 0

can you please email me the SR number, my email is in my profile info, thanks

Quincy561

1.3K Posts

0

June 11th, 2010 03:00

Sometimes counters are based on requests, which are accesses to cache slots. So if an IO spans a slot, it can be two requests.

Write misses can be attributed to other reasons than WP limits being reached. A slot being locked is one example.

If you are seeing WP counts very close to the limit, then there is a good chance the IOs could be delayed because of this. We can start issuing delays on the front end before the WP limit is reached.

The counter on the front end for WP events is a good place to look as mentioned.

bodnarg

385 Posts

0

June 11th, 2010 05:00

No to questions 2 & 3 - there are no obvious signs of WP events at the adapter or system level. The only oddness we see relating to write pending is some of the busy devices getting close to the limit and plateuing, but never hitting the write limit threshold.

Yes on write misses. An example of the discrepency - during the period in question (about 30 minutes from 9PM on) we have at the System metric level about 4000 writes per sec with less than 1000 write hits per second.

What is odd is that we don't necessarily see write misses, but we see write pendings get to about 95% of the write pending threshold.

Quincy561

1.3K Posts

0

August 17th, 2010 05:00

It can take a lot of SATA drives to saturate a DA. Can you upload your BTP file to the FTP area so I could take a look at it? And the bin file as well if you have it.

J

JasonBailey

147 Posts

0

August 17th, 2010 05:00

come on man stop teasing us performance guys and give us a btp

quincy is EMC Symm Perf Engineering btw..

bodnarg

385 Posts

0

August 17th, 2010 05:00

We are waiting a final answer back from EMC engineering, but looks like our DA paths were hitting saturation point. For some reason WLA was unable to show that fact with the data that it can parse though with Symmerge the performance GURU was able to show these ports being saturated.

WIll update once we receive an answer is to why this is the case.

bodnarg

385 Posts

0

August 17th, 2010 05:00

I don't want to step on any toes - we have an analysis that was done already and just to be clear we had an architecture issue. There are some other factors in play here that we were not aware at the time.

1) There are busy flash drives on this loop.

2) These are all Raid-6 SATA devices.

3) A lot of these devices got isolated behind a single pair of DAs.

My understanding so far is that WLA doesn't show DA throughput limits - only CPU limits and we were apparantely pushing the later. Hoping for a better solution/explanation of how we could detect/monitor the DAs in the future.

Quincy561

1.3K Posts

0

August 17th, 2010 06:00

Ok, that makes sense. It takes fewer drives to saturate a DA if the IOs are large, I was thinking of OLTP like small IO workloads.

View All

No Events found!