Limited cross site bandwidth SRDF question

Hi

I am working at a site whereby the cross site bandwidth is 300MB/s duplexed and often we see that capacity reached - usually during the early hours.

The environment is a mixture of SRDF/A and SRDF acp_disk.

Specifically what would happen in to either cross site sync (one being SRDF/A the other being SRDF acp_disk) when the capacity on the link bandwidth is reached. There is no QOS throttling set or priority given to traffic source.

What I am expecting is that the acp_disk copy would slow down to allow the SRDF/A traffic through.

Both source and targets are on different Symmetrixes (4) all the arrays involved are on 5778 code.

Thanks

Neil

Responses(3)

N

NeilHopwood

46 Posts

0

April 29th, 2010 06:00

Thanks ksmith!

That makes sense. We did see the SRDF/A session drop and it was because the utilization of the cache was going through the roof. I think for the time being we will throttle the acp_disk copies until we have the extra bandwidth in place.

I understand that transmit idle is enabled on a per group basis and the default is 60 seconds, though this site has it set at 30.

ksmith123

12 Posts

1

April 29th, 2010 06:00

This is how I understand it:

acp_disk replication will not stop, and it does give priority to srdf/a. If you have TransmitIdle enabled and the transfer rate is inhibited enough to cause srdf/a to fall behind, cache is utilized to store chunks of data yet to be sent to the target array. The source array will not allow cache utilization to reach higher than 80% as a cause of srdf/a data. When that threshold is reached the source array suspends srdfa replication.

If you do not have TransmitIdle enabled I believe that srdf/a will go suspended much more quickly.

Allen Ward

2.1K Posts

0

April 30th, 2010 11:00

I'd just like to clarify a few things as I think the discussion may not be going in entirely the right direction.

First off can you please confirm the Enginuity level... You listed it as 5778. I was under the impression that 5774 was still the latest released version (although we are still running 5773 on DMX3 arrays).

Next I'd like to discuss Transmit Idle a bit. Unless I'm missing the mark entirely, I don't think Transmit Idle will have any impact on the situation you are describing. Without Transmit Idle there is a setting called "Link Limbo" which defines how long a link outage the SRDF/A can sustain before dropping out of Async mode and partitioning the pairs. With Transmit Idle enabled, TI kicks in once Link Limbo is exceeded and uses cache to continue to track changes at the source end so that consistency and proper write order can be maintained until the link comes back online.Once the maximum cache is consumed in TI mode, groups start dropping based on priority and usage (busiest group drops first freeing up resources for other groups). If you also have Delta Set Extension (DSE) configured, then instead of dropping when the TI cache gets full things start getting migrated out to the DSE pool from cache. When the link comes back up things are moved back into cache as space frees up and cache feeds directly to SRDF/A to push changed tracks across the link. Everything should stay in async mode... which effectively means that your R2 devs will stay consistent instead of having to resynch first.

None of that has anything to do directly with situations where the link is up but is just "full". In situations where the hosts are generating more change on the source volumes than you can push across to the target array. In this case the changes will pile up in the Write Pending (WP) cache. At some point the WP cache threshold will be reached and the busiest RDF group will be dropped out of asynch mode to reallocate WP cache to the other groups. Sometimes this will also drop other groups, but it shouldn't drop them all. We have run into this in the past and ended up going through a series of upgrades to resolve the issue so that the mainframe group stopped dropping regularly due to high activity. As an amusing side note, the error indicating that a group was dropped due to WP cache thresholds being exceeded included the code "CACA10" in the alert. This lead to the obvious naming of these errors as "CACA 10" errors (pronounced with a hard "C") :-)

So, none of this really answers the root question. We are also experiencing something in one of our environments which prompted us to ask EMC how we could ensure that SRDF/A traffic is not losing priority to ACP_Disk traffic. The only answer we could get is to apply QOS settings. We are still waiting for the details on how to make the correct changes. They are waiting for more info from our mainframe storage group in order to determine the correct settings. I would strongly recommend you engage your local EMC Account Team or CE for advice.

View All

No Events found!