Start a Conversation

Unsolved

This post is more than 5 years old

C

2120

February 20th, 2015 06:00

Open Replicator Hot Push query

Hi,

We had to migrate the storage of a server on Friday (253 LUNs - 20TB). We decided to create a hot push OR session which we launched with -precopy 3 days before (on Wednesday) expecting that on Friday most of the data had been pre-copied and we had only to copy the final delta minimizing the time the server had to be powered off.

On Wednesday the pre-copy started at an acceptable rate (250 MB/s with pace 5 and ceiling 40%) but on Thursday morning the throughput had decreased down to 1 MB/s or similar values . Many of the LUNs were at 9x% or 100% pre-copied (8 TB out from the 20 TB had been pre-copied) but others LUNs were only at 2%, 3%

My question is why the pre-copy did not continue copying the data at the same rate ?. I can think that maybe as more LUNs are pre-copied the job has to keep track of what changes are done to this LUNs in order to execute the final copy but I'm only guessing.

Do you have any idea of what could be the reason for this ?.

On Friday morning I reconfigured the session and executed a hot pull so we could start the server before the copy had finished.

Thank you

---------------------------------

Sample output of the hot push session:

Session Name          : sarto

           Control Device                         Remote Device             Flags      Status     Done Pace      Name

-------------------------------------- ----------------------------------- ------- -------------- ---- ---- -----------------

                   Protected Modified

SID:symdev         Tracks    Tracks    Identification                   RI CDSHUTZ  CTL <=> REM    (%)

------------------ --------- --------- -------------------------------- -- ------- -------------- ---- ---- -----------------

0XXXXXXXX:3A84    7967660         0 0YYYYYYYYYY:17B5                SD XXXX.S. Precopy           2    1 sarto

0XXXXXXXX:3A80    7926551         0 0YYYYYYYYYY:17B1                SD XXXX.S. Precopy           3    1 sarto

0XXXXXXXX:3A7C    7689385         0 0YYYYYYYYYY:17AD                SD XXXX.S. Precopy           6    1 sarto

0XXXXXXXX:3A78    7433552         0 0YYYYYYYYYY:17A9                SD XXXX.S. Precopy           9    1 sarto

.... [output was cut]

0XXXXXXXX:299C       7684         0 0YYYYYYYYYY:16E9                SD XXXX.S. Precopy          98    1 sarto

0XXXXXXXX:299B          0         0 0YYYYYYYYYY:16E8                SD XXXX.S. Precopy         100    1 sarto

0XXXXXXXX:299A          0         0 0YYYYYYYYYY:16E7                SD XXXX.S. Precopy         100    1 sarto

0XXXXXXXX:2999       3708         0 0YYYYYYYYYY:16E6                SD XXXX.S. Precopy          99    1 sarto

Total              ---------

  Track(s)          198958599

  MB(s)             12434918

Copy rate                      :      1.4 MB/S

Estimated time to completion   : 95 days, 16:24:55

465 Posts

February 22nd, 2015 16:00

I wonder if the devices with the low copy rate are more highly utilised by the host. Have you had a look at lun stats from the host IO perspective? Are these very busy luns?

Also, as a general rule copy operations are throttled in a busy system. Take a look at the Unisphere device and system write pending stats, was your array under write pending stress during the time of the slow copy?

Finally, check the FA % busy. Is there a correlation between the slow down and high FA CPU busy?

34 Posts

February 23rd, 2015 06:00

I've seen this behaviour with pre-copy, i'll try and dig out the answer with more details but the bottom line is it's something to do with tasks that run of the FA processor. 

We did some testing and found that if you keep the number of devices below about 10 all is good but if go about that then it gets problematic, as devices get close to copied, all other devices will stall and the copy rate drops to pretty much nothing.   Setting individual devices to nocopy when they were close to being copied helped a little but sometimes if would take half an hour or so for the copy rate to pick up (not much good in tight change windows).  In the end we just gave up and switched to using hot push differential instead, this worked much more predictably.

1 Rookie

 • 

20.4K Posts

February 23rd, 2015 08:00

i am currently migration DMX4 (~300TB) to a VNX5600 and see the same thing. All my sessions are hot -precopy, they will get pretty close to 100% precopied but when i activate them they will seat there for 30min to an hour just to copy the last 30 megabytes. It has to do with some kind of scheduling because they will seat there and then all of a sudden they all take off and finish immediately.  When i create my sessions i start with pace 9 and then an hour before cut-over i change pace to 0. Ceiling is set to None on all FAs.

34 Posts

February 23rd, 2015 09:00

Sounds very similar to how ours behaved, we did however have a ceiling set as there FA's were shared with other hosts and impacting them would have been a bad day at the office.   It's a while since we did this but i think the switch from precopy to copy would sit there for a random amount of time before the session picked up and hit the ceiling limit, this could be anywhere from 5 mins to an hour depending upon the size of the session.   This is not great when you only have a 2 hour change window to get the migration through and the copy rate is 1.1MB/s and claims 3 days to complete.

This is taken from emc286167

When the ceiling is set to any value between 0 and 100 there is a default task 15 queue limit of 10 which limits the amount of copy task that ORS can create. ORS can create one task 15 for each device concurrently, so if you are attempting to copy 300 logical devices in the array, and the ceiling is set to default of 10, then the throughput will be limited by the imposed queue limit of the task 15. This is by design and imposed to prevent impact to other applications sharing the FA ports.

If the Ceiling is not set then the ORS can consume the FA CPU to scan for tracks to copy. This can impact other applications running on the same FA's and for this reason it is recommended to always set a ceiling value between 0 and 100 (unless using dedicated FA CPU's for the ORS migration).

1 Rookie

 • 

20.4K Posts

February 23rd, 2015 09:00

i forgot to add that i do have a set of dedicted FAs that i use for OR, the rest of the FAs are zoned to VNX ports but ceiling is set to 0.  Last couple of migrations i tried to raise all FAs to None thinking it would help but did not see much improvement. It is so frustrating to watch it go at 2MB/s with only 30 megabytes left.

No Events found!

Top