thanks Mohit, just to be clear, we will have full sweep (clearing of journal), but the target VMDK will still be somewhat in sync with the production copy and we will avoid full initialization?.
digging deeper into the journal and datastore issue.. if the array in question is XtremIO, can we theoretically create a single LUN/datastore just for journals? I know you said best practice is to "load balance" which I assume you mean keep them with their source/target VMDKs, but to keep things organized on a system like XIO, can we create 1 LUN/datastore for everything?
"Load balance" was meant in terms of keeping the journal volumes distributed over multiple LUNs thus avoiding the bottleneck on the journal LUN level. Every write IO from production leads to 5 R/W IOs on the target journal LUN. this proves performance bottleneck if all the journal are from same disks.
in RP4VM, in theory it is possible to create a single LUN/datastore for journal purpose but recommended to spread the journal to avoid performance bottlenecks.
Whether if its a XtremIO or any other array, a detailed check on the required IO specs would provide a correct answer for your question.
This Ask the Expert event has officially ended, but don't let that retract you from asking more questions. At this point our SMEs are still welcomed to answer and continue the discussion though not required. Here is where we ask our community members to chime in and assist other users if they're able to come up with accurate information.
Many thanks to our SMEs who selflessly made themselves available to answer questions. We also appreciate our community members for taking part of the discussion and ask so many interesting questions.
ATE events are made for your benefit as members of ECN. If you’re interested in pitching a topic or Subject Matter Experts we would be interested in hearing it. To learn more on what it takes to start an event please visit our Ask the Expert Program Space on ECN.
2. What is snapshot granularity? Does this refer to how often a snapshot of the production VM is taken and replicated to the target journal? If so, what is the difference between this option and RPO?
Finally, can you go over the process of distribution in RP4VM for me? As I understand, production VM is “snapshotted” as in classic RP and this snapshot (delta of changes since last snapshot) is transferred to target journal and later distributed to target VMDK?
If you wish to understand RPO in Recoverpoint, you need to clearly understand what lag means and how it impacts the RPO.
Data lag on the link can be defined as the difference between the source and target site.
RPO controls the difference in data between production and DR. The book definition says RPO is the the amount of data the customer can afford to lose in case of a disaster at Production. When we talk about RPO in respect of RP replication we control RPO by defining LAG and application regulation setting.
if customer wants a specific RPO to be enforced then application regulation can be enabled. When its enabled, RP slows host applications when approaching the lag policy limit to guaranteeing the lag setting.
In case application regulation is not desired, the recoverpoint system would try to keep the difference between sites to a minimum by utilizing as much bandwidth and resources as needed.
While talking about RPO I would like to bring up the concept of Sync and async replication.
In sync replication, the production host application initiates a write, and then waits for an ACK from the remote RPA before initiating the next write, ensuring that the RPO value is always zero. in other words regulate application is enabled in sync replication.
In Async replication, host application initiates a write, and does not wait for an ACK from the remote RPA before initiating the next write. The data of each write is stored in the local RPA, and acknowledged at the local site. The RPA decides based on the lag (RPO) policy and system loads/available resources when to transfer the writes in the RPA to the replica storage.
I would like to clarify an important point. We are not taking snapshots of production VM and replicating the difference to DR. We are intercepting every write I/O for the production VM and sending writes either sync or async to the DR.
If the RPO value is set too high, it will not be as per the customer RPO agreement. if the RPO value is set too low, RP will continue to generate events messages indicating that current lag value exceeds the maximum permitted lag value.
Snapshot granularity can be of 3 types
1. Per write
2. Per second
Per write indirectly means sync replication and would ensure 0 RPO. Per second and dynamic would be for async replication. per second provides near zero RPO. Dynamic would mean system will decide when and how much data needs to be replicated from source to target. Dynamic will try to keep the lag under the permitted value.
Finally, the distribution for RP4VMs remains same as classic recoverpoint. Please refer to 5 phase, 3 phase and 1 phase distribution on Recoverpoint admin guide as its explanation is beyond the ECN discussion. If you have any specific question regarding any phase with respect to RP4VM, please let me know.
Once again I would emphasis on the point that in RP4VM, we are not taking "vmware" snapshots of prod VM and replicating the difference, the ESXi splitter on each host intercepts the prod VM write IOs and sent to vRPA which in turn replicates to DR vRPAs.
Thanks Suden. As always, super helpful.
I did have another question about how RP4VM handles failure scenarios:
Let's say production site is down and DR becomes active (failover scenario).
What happens when production site is back online now and DR is still active? Do we initiate a failback? Will the production RP4VM assume identity as active cluster and try to replicate to DR? Or will an initial sync be done to determine that data on DR is newer than data on prod?
Taking an example of SiteA as Production, Site B as DR and consistency group CG1 has a VM protected.
When production is down, a user action is required to initiate a failover wizard which will first enable the image access to test the image in question and later complete the failover procedure.
Once failover is successfully completed, SiteA becomes DR with no access to the original VMs and the VM at siteB has the production operational.
Now when you say production is back online, from recoverpoint point of view the SiteA VM is still a shadowVM with extension .recoverpoint and has role of a remote VM copy. Unless a user action is initiated to perform a failover again to make the SiteA VM as production again, the production application at SiteA wouldn't be live.
Here are the logical steps as per your question.
1. production site is down
2. failover is initiated which will reverse the copy roles, SiteB becomes production and starts application.
3. application continues at Site B until it is decided to bring SiteA as prod again
4. failover is initiated again to reverse the roles
5. Once failover completes SiteA is production again.
When failover happens the journal for the original production VM/copy is erased. No full init/sweep is required when failover is performed second time as a specific image is chosen to bring the application back at SiteA and replication would only reverse the direction.
Thanks Mohit, my question is more on how the production side knows that it is DR. When failover occurs, site A is down so the cluster in site A does not know it is now DR. What's the mechanism behind this when site A is back online? Does site A do a sync with site B when it's back online and then knows it is DR? There's no way to "mark" site A as DR if the cluster is offline during failover.