Application Upgrade - Options to roll-back

Question

Hello guys and gals, Take a look at my directory layout and SnapshotIQ and SyncIQ configuration. Snapshots are taken and 'District' folder level, while SyncIQ source is at the 'Customer' directory level. I am getting ready to do an application upgrade that uses folder App3 as its data repository. This directory is about 40TB in size, contains a couple of million files and directories. I do not know what application is going to do to the data and how much of it will be changed. Let's say i need to plan for worth case scenario and all 40TB will change. Customer is asking for options in terms of roll-back, if upgrade goes horribly wrong what can i do to get them to pre-upgrade state as soon as possible. So i started thinking of my options: Option 1 - take custom snapshot of App3 directory right before upgrade starts. Typically it's a great option to be able to restore individual files from .snapshot directory but if all 40TB changes running 'cp -r' from .snapshot directory is not viable. Option 2 - take custom snapshot of App3 directory right before upgrade starts. Use SnapRevert to roll back entire snapshot. I remember we had a discussion about this a few months ago where it was determined that SnapRevert will wipe entire App3 directory and copy entire 40TB from scratch. We are looking at days of downtime while it's happening. Option 3 - setup custom SyncIQ session where App3 is my source. Right before my upgrade i will initiate final SyncIQ copy. If upgrade goes horribly wrong i will fail over this custom SyncIQ session and have client mount App3 from secondary cluster.  This sounds like the most efficient way to get customer back into pre-upgrade state. At some point I need to get the customer back onto primary cluster. A few questions about this option: 3a - When i fail over to secondary cluster, will it automatically setup a SyncIQ session to replicate from secondary to original App3 directory on primary cluster ? 3b - When i fail back to primary cluster and start copying the changes back from secondary to primary, is it an incremental copy or is it a full copy (i am using policy_mirror session that was created during fail over) ? What do you think, have i missed anything, would prefer most space consumption efficient approach, general comments/feedback ? OneFS 7.1.1.4 Application in question runs on CentOS, NFS mounts Isilon. Thank you

sluetze · Answer

hi dynamox,

3a) after doing the failover (allow-write) You have to do a Prepare Resync to setup the mirror-policy. You may want to configure a schedule (afaik it's on manual after creation)

3b) it is an incremental copy. In case all of the data has changed this won't help you. I did some testings in the last time and as long as you don't screw up the policy relation between the normal and the mirror - policy you won't have to perform an FullSync.

Something more to mind:

The application will not be able to change the 40TB instantly. This would also take some time. so it may not be necessary to do a complete restore of all the files in case the upgrade fails but only of parts of the data.

Regards

--sluetze

Peter_Sero · Answer

Check for 'overlayfs' for Linux. It's simple, you can always revert to the 'underlying' read-only data by just unmounting  and wiping the scratch overlay, which only contains the diffs written since mounted. No per-file recoveries, that makes reverting fast.. As you (just) estimated about 2TB worth of diffs in your case, you would need at least that amount of local scratch.

Stdekart · Answer

dynamox,

I just want to make sure to procedure for the fail over/fail back is documented in this thread for future reference.

I agree with sluetze that using SyncIQ fail over/fail back would be the best viable option/fastest recovery time.

If there is a failure simply make the target (pre-upgrade data) read/write, and point the app at the DR clusters IP's.

7.1.0.x (Page 220-221)

https://support.emc.com/docu50220_OneFS-7.1.0-Web-Administration-Guide.pdf?language=en_US

7.1.1.x (Page 238-240)

https://support.emc.com/docu54201_OneFS-7.1.1-Web-Administration-Guide.pdf?language=en_US

7.2.0.x (Page 258-260)

https://support.emc.com/docu56049_OneFS-7.2-Web-Administration-Guide.pdf?language=en_US

Peter_Sero · Answer

A bit out of the left field: why not doing a test run of the new app version first,

keeping all changes local to the app server...

WHAT?

Here is what I have in mind:

- take a snapshot of App3/

- mount the snapshot it read-only on a test server

- overlay that read-only NFS mount with a local, writable scratch filesystem

- launch new app version against the overlay

- if overlay (diffs to r/o mounted) exceeds local scratch size... have to cancel the test...

- otherwise, test new app to satisfaction

- if test ends fine, launch new app in production

- (you will still want to have some "backup" to revert to,

but as an app failure has become less likely,

your regular backup/restore SLA might suffice.

Plus, from the overlay test you might have learned something

about how the new app acts on the data,

and might be able to refine the specific backup strategy on the Isilon.

E.g. imagine the app just rebuilding some index files but

leaving 99.99% of stuff untouched.)

Just fwiw. Curious to see how the project will evolve, good luck!

-- Peter

dynamox · Answer

Thanks sluetze. Yes, i don't think they will change 40TB worth of data, let's say they only change 2TB. Option 2 would be great if only it were to revert changed files, not restore entire 40TB.

So for 3a i will need to complete these 3 steps (part of the fail back procedure) to get my replication going from secondary to primary.

On the primary cluster, click Data Protection > SyncIQ > Policies .
In the SyncIQ Policies table, in the row for a replication policy, from the Actions column, select Resync-prep.SyncIQ creates a mirror policy for each replication policy on the secondary cluster.SyncIQ names mirror policies according to the following pattern: _mirror
On the secondary cluster, replicate data to the primary cluster by using the mirror policies.You can replicate data either by manually starting the mirror policies or by modifying the mirror policies and specifying a schedule.

dynamox · Answer

Hi Peter,

i am not sure what you mean by "overlay". I don't know (nor the app owner) if upgrade process changes metadata only or actual data files so i have to be able to recover "everything", there is no going and restoring individual files.

dynamox · Answer

Shane can you please confirm that fail back is incremental?

Stdekart · Answer

dynamox,

I can confirm fail back in incremental. After speaking to a few technical support engineers in the backups (they cover SyncIQ)

They also made the recommendation to open and SR, after the initial sync from source to target, to validate everything is good to go. They can also be available to help with the fail back if needed.

dynamox · Answer

Thank you Shane. Any particular reason they are recommending to validate initial sync ? We setup SyncIQ policies all the time and assumption is if policy completes without errors, then all data is on secondary cluster.

Stdekart · Answer

dynamox,

Confusion of steps when setting SyncIQ up for fail over fail back. Cases in the past have come down to procedure not being followed correctly, which results in a full sync. It's a better safe then sorry/sanity check.

Stdekart · Answer

dynamox, Correct everything step is outlined in the links I mentioned earlier. It's issues with them being followed correctly that result in a full resync.

dynamox · Answer

this is how you have the fail back procedure documented in the online help. I think it's pretty straight forward, anything missing ?

Fail back data to a primary cluster

After you fail over to a secondary cluster, you can fail back to the primary cluster.

Before you begin

Fail over a replication policy.

Procedure

On the primary cluster, click Data Protection > SyncIQ > Policies .
In the SyncIQ Policies table, in the row for a replication policy, from the Actions column, select Resync-prep.SyncIQ creates a mirror policy for each replication policy on the secondary cluster.SyncIQ names mirror policies according to the following pattern: _mirror
On the secondary cluster, replicate data to the primary cluster by using the mirror policies.You can replicate data either by manually starting the mirror policies or by modifying the mirror policies and specifying a schedule.
Prevent clients from accessing the secondary cluster and then run each mirror policy again.To minimize impact to clients, it is recommended that you wait until client access is low before preventing client access to the cluster.
On the primary cluster, click Data Protection > SyncIQ > Local Targets .
In the SyncIQ Local Targets table, from the Actions column, select Allow Writes for each mirror policy.
On the secondary cluster, click Data Protection > SyncIQ > Policies .
In the SyncIQ Policies table, from the Actions column, select Resync-prep for each mirror policy.

After you finish

Redirect clients to begin accessing the primary cluster.

Isilon

Fail back data to a primary cluster

Before you begin

Procedure

After you finish

Was this post helpful?