Start a Conversation

Unsolved

This post is more than 5 years old

17675

May 6th, 2016 12:00

Ask the Expert: RecoverPoint for Virtual Machine 4.3 SP1

YOU MAY ALSO BE INTERESTED ON THESE ATE EVENTS...

Ask the Expert: Business Continuity; disaster recovery vs. data availability

Ask the Expert: VPLEX and RecoverPoint – What’s New?

Ask the Expert: All about Data Replication Recovery and Protection with Recoverpoint (SE)

Welcome to this Ask the Expert discussion. On this occasion we're giving you the opportunity to debate with our expert about RecoverPoint for Virtual Machine 4.3 SP1.

EMC RecoverPoint for Virtual Machines 4.3 is a hypervisor-based, software-only data protection solution for protecting VMware® virtual machines. RecoverPoint for VMs enables both local and remote replication, allowing recovery to any point-in-time. RecoverPoint for VMs 4.3 SP1 includes few features like Multi-cluster support , VC licensing, Deployment automation and Orchestration

Meet Your Experts:

Mohit.Sagar.JPG.jpg

Mohit Sagar

Technical Support Engineer II

I have been with EMC for last 4.5 years and has an overall work-experience of 9 years. I have worked on multiple technologies before joining EMC and has a strong specialization on EMC storage and Virtualization technologies apart from RecoverPoint solution.

lightbulb.png INTERESTED ON A PARTICULAR ATE TOPIC? SUBMIT IT TO US


This discussion will take place May 16th - 27th. Get ready by bookmarking this page or signing up for e-mail notifications.

Share this event on Twitter or LinkedIn:

>> Ask the Expert: RecoverPoint for Virtual Machine 4.3 SP1 http://bit.ly/1TvVXnI #EMCATE <<

13 Posts

June 9th, 2016 03:00

If you wish to understand RPO in Recoverpoint, you need to clearly understand what lag means and how it impacts the RPO.

Data lag on the link can be defined as the difference between the source and target site.

RPO controls the difference in data between production and DR. The book definition says RPO is the the amount of data the customer can afford to lose in case of a disaster at Production. When we talk about RPO in respect of RP replication we control RPO by defining LAG and application regulation setting.

if customer wants a specific RPO to be enforced then application regulation can be enabled. When its enabled, RP slows host applications when approaching the lag policy limit to guaranteeing the lag setting.

In case application regulation is not desired, the recoverpoint system would try to keep the difference between sites to a minimum by utilizing as much bandwidth and resources as needed.

While talking about RPO I would like to bring up the concept of Sync and async replication.

In sync replication, the production host application initiates a write, and then waits for an ACK from the remote RPA before initiating the next write, ensuring that the RPO value is always zero. in other words regulate application is enabled in sync replication.

In Async replication, host application initiates a write, and does not wait for an ACK from the remote RPA before initiating the next write. The data of each write is stored in the local RPA, and acknowledged at the local site. The RPA decides based on the lag (RPO) policy and system loads/available resources when to transfer the writes in the RPA to the replica storage.

I would like to clarify an important point. We are not taking snapshots of production VM and replicating the difference to DR. We are intercepting every write I/O for the production VM and sending writes either sync or async to the DR.

If the RPO value is set too high, it will not be as per the customer RPO agreement. if the RPO value is set too low, RP will continue to generate events messages indicating that current lag value exceeds the maximum permitted lag value.

Snapshot granularity can be of 3 types

1. Per write

2. Per second

3. dynamic

Per write indirectly means sync replication and would ensure 0 RPO. Per second and dynamic would be for async replication. per second provides near zero RPO. Dynamic would mean system will decide when and how much data needs to be replicated from source to target. Dynamic will try to keep the lag under the permitted value.

Finally, the distribution for RP4VMs remains same as classic recoverpoint. Please refer to 5 phase, 3 phase and 1 phase distribution on Recoverpoint admin guide as its explanation is beyond the ECN discussion. If you have any specific question regarding any phase with respect to RP4VM, please let me know.

Once again I would emphasis on the point that in RP4VM, we are not taking "vmware" snapshots of prod VM and replicating the difference, the ESXi splitter on each host intercepts the prod VM write IOs and sent to vRPA which in turn replicates to DR vRPAs.

32 Posts

June 10th, 2016 06:00

Thanks Suden. As always, super helpful.

I did have another question about how RP4VM handles failure scenarios:

Let's say production site is down and DR becomes active (failover scenario).

What happens when production site is back online now and DR is still active? Do we initiate a failback? Will the production RP4VM assume identity as active cluster and try to replicate to DR? Or will an initial sync be done to determine that data on DR is newer than data on prod?

13 Posts

June 14th, 2016 01:00

Taking an example of SiteA as Production, Site B as DR and consistency group CG1 has a VM protected.

When production is down, a user action is required to initiate a failover wizard which will first enable the image access to test the image in question and later complete the failover procedure.

Once failover is successfully completed, SiteA becomes DR with no access to the original VMs and the VM at siteB has the production operational.

Now when you say production is back online, from recoverpoint point of view the SiteA VM is still a shadowVM with extension .recoverpoint and has role of a remote VM copy. Unless a user action is initiated to perform a failover again to make the SiteA VM as production again, the production application at SiteA wouldn't be live.

Here are the logical steps as per your question.

1. production site is down

2. failover is initiated which will reverse the copy roles, SiteB becomes production and starts application.

3. application continues at Site B until it is decided to bring SiteA as prod again

4. failover is initiated again to reverse the roles

5. Once failover completes SiteA is production again.

When failover happens the journal for the original production VM/copy is erased. No full init/sweep is required when failover is performed second time as a specific image is chosen to bring the application back at SiteA and replication would only reverse the direction.

32 Posts

June 14th, 2016 08:00

Thanks Mohit, my question is more on how the production side knows that it is DR. When failover occurs, site A is down so the cluster in site A does not know it is now DR. What's the mechanism behind this when site A is back online? Does site A do a sync with site B when it's back online and then knows it is DR? There's no way to "mark" site A as DR if the cluster is offline during failover.

Thanks

13 Posts

June 17th, 2016 00:00

Hi,

I believe I didn't mentioned in my previous post, production site is down means the host or application is down not the prod RP cluster.

In cases of disaster when prod RP cluster is up, in case of failover and failback, there will not be a full sweep.

In cases where both production application and RP cluster are unavailable, image from DR is presented to the DR host to bring the application back up. When the production cluster is back online, full sweep is needed to bring the data consistency between both copies.

32 Posts

June 17th, 2016 05:00

Hi Mohit, good to know there is a full sweep, but still don’t quite understand the mechanism.

I’m just wondering how the prod cluster (site A) knows that it is no longer production when it’s online again so it will not attempt to replicate to site B since site B will be replicated to site A at this point (before failback). Does site A automatically try to sync with site B when it’s back online and sees that site B data is newer?

13 Posts

June 17th, 2016 06:00

Before the production was down, the DR site was in no access mode. When the production came backup it will try to sync with DR and finds that the setting are changed (DR is in image access or direct access mode). This will be reported as conflict and user will be presented a dialog box informing that there is setting conflict and ask to choose the site with correct settings.

At this point, DR setting should be selected and prod will sync up correctly.

In case prod is selected in the setting conflict, the DR image access will be disabled and prod will try to replicate in the same direction which would be wrong.

32 Posts

June 27th, 2016 21:00

thanks as usual, Mohit.

Recently I discovered that it is a rather complicated process to complete RE-IP of VM during test copy / failover / restore prod.

Questions:

1. Is the RE IP workflow improved at all in recent GA of RP4VM or is the process still the same with the glue scripts downloaded to source VM?

2. What is the difference between modifying IP settings on source VM and copy VM in RP4VM vSphere GUI?

3. do we create a .bat file to load all the scripts in glue package on a Windows host? Documentation is not very clear here.

4. If we make these registry changes to run scripts on a production source VM, will this impact the production VM at all upon reboots?

No Events found!

Top