I was under the assumption that resuming an acp_disk synchronisation was a low priority SRDF "task" and that existing SRDF/S data streams were not impacted. I thought that the adaptive copy process would only use the "free" bandwidth so the "production bandwidth" would not be impacted.
Every day we're copying from 300 to 500 gigs of data using acp_disk and let it run for almost 2 hours. After this we switch to SRDF/A to get the data consistent and after a minute or 2 or 3 we do a suspend.
Everything worked just fine, but yesterday an SRDF/S customer complained that during this 2 hour period he experiences extreme slowness on his SAN attached disks. Considering the history of this customer and previous incidents, I'm pretty sure it has to do with our acp_disk sync which occurs every day.
What I would like to know is if we can change the priority of this acp_disk sync or at least explain to me why EMC says that adaptive copy is a low priority thing, while in fact other synchronous links ARE stil impacted.
unfortunately i don't have a lot of experience with adaptive copy modes. Looking at the doc ..it says that when the skew number is exceeded it will turn either synch or semi-synch on until the number of invalid tracks is back below the skew number. Could that be the case in your situation ?
Hey ! If that is the case we have an issue here !!! I was always told that for an initial sync you should always use adaptive copy, but if you're using a somewhat larger device or meta, the number of tracks that need to be copied exceeds the max nuber of invalid tracks (skew value ?) and thus everything turns to async, sync or semi-sync ? I can't believe that is actually what happens. This would mean that every initial pair create would start running as (a/semi-)sync.
But you triggered me. I will start digging deeper now.
An interesting question! Adaptive copy mode does have a lower priority than synchronous SRDF, but there are other factors to consider too. I'm not a Symmetrix performance guru (and I do suggest you open an SR with your local support to get a performance specialist engaged).
Here's a few things to think about in the meantime:
The max ACp skew (65535) is interpreted as infinity. In other words, if its set at this (default) value, the RDF mode stays in adaptive copy.
Are your SRDF links heavity utilised with SRDF/S traffic, such that the additional ACp traffic now completely saturates the link? This will cause everything to slow down.
The additional backend (DA + disk) activity imposed by the sync could be causing increased service times for some hosts - especially if the volumes they're accessing reside on the same spindles as the volumes being synched (remember there's only one set of drive heads!).
RAID mode can have an impact on the above - with mirrored volumes there are obviously 2 copies of the data - either of which can service a read. With RAID5 there is only one copy, so read misses have to come from that one device - potentially causing hot spindles.
Cache Pressure - If the cache is very full of data waiting to be destaged (dirty), RDF writes to volumes in ACp_disk mode can be discarded once the local write(s) have completed. The remote track gets marked as invalid. This allows a cache slot to be reused quickly without having to wait for RDF. However, the track will eventually need to be copied over the RDF link, and the penalty for the quick win is having to do another local read from disk into cache in order to then send the track to the remote box. This results in higher backend I/O activity.
RDF group utilisation. Are there multiple RDF groups per RA? How is the workload balanced between them?
I believe the RA will "round-robin" if there are multiple RDF groups on it. Therefore it may be possible that a low priority ACp sync in an otherwise idle group could be disproportionately taking cycles from busy SRDF/S sessions in other groups on the same RA.
The answer? Lots of analysis to determine where the perceived performance problem lies. For a quick fix, I would start by investigating the symqos (Quality of Service) Solutions Enabler commands, with which you can set the SRDF copy pace. For example: # symqos -g drdf set RDF pace 10
...sets the RDF copy pace of all the devices in group "drdf" to 10, which is the slowest. You'll need to tune the pace setting, but this may allow you to slow down the ACp sync such that it does not impact the SRDF/S.
I quote: "The skew value is configured at the device level and may be set to a value between 0 and 65,534 tracks. For devices larger than a 2 GB capacity drive, a value of 65,535 can be specified to target all the tracks of any given drive." So setting the skew to the maximum value, it should never switch to SRDF/S or /A, but still, when my 200 LUN's (single symdevs as well as metas) start "adaptive copying", our SRDF/S is impacted and we received incidents from customers, saying that their performance was severily impacted.
I regret to say that my question still isn't answered. I must also add to this issue that we created a workaround for this problem: we started doing these ACP_Disk syncs more than once a day instead of once a day and (for now) nobody complaned.
Although the question hasn't really been answered, I'd like to close this question / topic, since it no longer applies to a live problem we were facing.