Unsolved
This post is more than 5 years old
2 Intern
•
5.7K Posts
0
4350
What's the impact on existing SRDF/S when new adaptive copy starts running
I was under the assumption that resuming an acp_disk synchronisation was a low priority SRDF "task" and that existing SRDF/S data streams were not impacted. I thought that the adaptive copy process would only use the "free" bandwidth so the "production bandwidth" would not be impacted.
Every day we're copying from 300 to 500 gigs of data using acp_disk and let it run for almost 2 hours. After this we switch to SRDF/A to get the data consistent and after a minute or 2 or 3 we do a suspend.
Everything worked just fine, but yesterday an SRDF/S customer complained that during this 2 hour period he experiences extreme slowness on his SAN attached disks. Considering the history of this customer and previous incidents, I'm pretty sure it has to do with our acp_disk sync which occurs every day.
What I would like to know is if we can change the priority of this acp_disk sync or at least explain to me why EMC says that adaptive copy is a low priority thing, while in fact other synchronous links ARE stil impacted.
Every day we're copying from 300 to 500 gigs of data using acp_disk and let it run for almost 2 hours. After this we switch to SRDF/A to get the data consistent and after a minute or 2 or 3 we do a suspend.
Everything worked just fine, but yesterday an SRDF/S customer complained that during this 2 hour period he experiences extreme slowness on his SAN attached disks. Considering the history of this customer and previous incidents, I'm pretty sure it has to do with our acp_disk sync which occurs every day.
What I would like to know is if we can change the priority of this acp_disk sync or at least explain to me why EMC says that adaptive copy is a low priority thing, while in fact other synchronous links ARE stil impacted.
vincent_corona
146 Posts
0
March 1st, 2008 19:00
RRR
2 Intern
2 Intern
•
5.7K Posts
0
January 4th, 2008 08:00
I was always told that for an initial sync you should always use adaptive copy, but if you're using a somewhat larger device or meta, the number of tracks that need to be copied exceeds the max nuber of invalid tracks (skew value ?) and thus everything turns to async, sync or semi-sync ? I can't believe that is actually what happens. This would mean that every initial pair create would start running as (a/semi-)sync.
But you triggered me. I will start digging deeper now.
Anyone else any ideas ?
RRR
2 Intern
2 Intern
•
5.7K Posts
0
January 4th, 2008 08:00
Thank you so far
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
January 4th, 2008 08:00
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
January 4th, 2008 08:00
RRR
2 Intern
2 Intern
•
5.7K Posts
0
January 4th, 2008 08:00
Do you know what the impact is on existing SRDF/S when adaptive copy is started or is running ? It should be low impact, but I have my doubts.
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
January 4th, 2008 08:00
MarcT2
131 Posts
1
January 4th, 2008 09:00
Here's a few things to think about in the meantime:
The max ACp skew (65535) is interpreted as infinity. In other words, if its set at this (default) value, the RDF mode stays in adaptive copy.
Are your SRDF links heavity utilised with SRDF/S traffic, such that the additional ACp traffic now completely saturates the link? This will cause everything to slow down.
The additional backend (DA + disk) activity imposed by the sync could be causing increased service times for some hosts - especially if the volumes they're accessing reside on the same spindles as the volumes being synched (remember there's only one set of drive heads!).
RAID mode can have an impact on the above - with mirrored volumes there are obviously 2 copies of the data - either of which can service a read. With RAID5 there is only one copy, so read misses have to come from that one device - potentially causing hot spindles.
Cache Pressure - If the cache is very full of data waiting to be destaged (dirty), RDF writes to volumes in ACp_disk mode can be discarded once the local write(s) have completed. The remote track gets marked as invalid. This allows a cache slot to be reused quickly without having to wait for RDF. However, the track will eventually need to be copied over the RDF link, and the penalty for the quick win is having to do another local read from disk into cache in order to then send the track to the remote box. This results in higher backend I/O activity.
RDF group utilisation. Are there multiple RDF groups per RA? How is the workload balanced between them?
I believe the RA will "round-robin" if there are multiple RDF groups on it. Therefore it may be possible that a low priority ACp sync in an otherwise idle group could be disproportionately taking cycles from busy SRDF/S sessions in other groups on the same RA.
The answer? Lots of analysis to determine where the perceived performance problem lies. For a quick fix, I would start by investigating the symqos (Quality of Service) Solutions Enabler commands, with which you can set the SRDF copy pace. For example:
# symqos -g drdf set RDF pace 10
...sets the RDF copy pace of all the devices in group "drdf" to 10, which is the slowest. You'll need to tune the pace setting, but this may allow you to slow down the ACp sync such that it does not impact the SRDF/S.
Hope that's of some help!
Best Regards, Marc
xe2sdc
2 Intern
2 Intern
•
2.8K Posts
0
February 18th, 2008 02:00
xe2sdc
2 Intern
2 Intern
•
2.8K Posts
0
February 18th, 2008 02:00
RRR
2 Intern
2 Intern
•
5.7K Posts
0
February 18th, 2008 02:00
So setting the skew to the maximum value, it should never switch to SRDF/S or /A, but still, when my 200 LUN's (single symdevs as well as metas) start "adaptive copying", our SRDF/S is impacted and we received incidents from customers, saying that their performance was severily impacted.
I regret to say that my question still isn't answered. I must also add to this issue that we created a workaround for this problem: we started doing these ACP_Disk syncs more than once a day instead of once a day and (for now) nobody complaned.
Although the question hasn't really been answered, I'd like to close this question / topic, since it no longer applies to a live problem we were facing.
RRR
2 Intern
2 Intern
•
5.7K Posts
0
February 18th, 2008 03:00
vincent_corona
146 Posts
1
February 28th, 2008 14:00
vincent_corona
146 Posts
0
February 29th, 2008 12:00
RRR
2 Intern
2 Intern
•
5.7K Posts
0
February 29th, 2008 12:00