I would like to set-up large journals for my CRR LUNs that are replicated by RecoverPoint. I would like a long protection window for these LUNs. There are 2 reasons I want to do this.
First so I can go back in time by something like 10 days.
Second I want to do long DR Tests. Again something like 10 days.
To faciliate long DR Tests I will also apply a tuning to the ReocverPoint Journals. I will increase the target side processing tunings on the RP Journal. This is also called "Proportion of journal allocated for image access log" in RP GUI.
The only small issue I can see with this scenario is a long journal lag after 10 days of DR Testing. I know on average even with a large journal, when not mounted in image access mode, the journal activitely gets copied into the CRR LUN. That is the CRR LUN is not behind by the size of the journal.
Am I missing something? If I do this I will make the journals about 40% of the source LUN Size. I have 30 LUNs of average size 400GB. The LUNs store ESX Servers VMFS and associated virtual machines.
You raise a number of interesting questions to consider. Yes, you could use RecoverPoint to go back 10 days for extended target side processing. However, it may not always be a good idea.
First, let's discuss the image access modes. If you use virtual access (available with most splitters), you can get to any point in time almost immediately. The RPA and the splitter will reconstruct the replica LUN at that point in time without physically rolling it back, and present it to the host. Virtual access is intended for a "quick look" into a point time. It is not designed for extended target side processing or heavy IO activity.
In your scenario, you will need to get into logged (a.k.a physical) access mode, where replica is actually physically "played" back to a specific point in time. You will be basically undoing up-to 10 days of production writes, which means we'll have to move a lot of data from journals to replica. Depending on how write-intensive your production 10 days worth of writes may take a considerable amount of time to undo.
You could also use virtual access with roll, where you start in virtual access and eventually end in physical. However, you will need to wait for physical access to perform any intensive activity against the replica.
Once you are in physical access, you need to monitor two parameters: do you have enough journal for TSP (20% by default) and whether you have enough journal (the other ~80%) to store all snapshots between image access PiT and the latest production snapshot. For example, if you went back 10 days, but your journal can only store 12 days, you can only remain in image access for ~2 days.
Finally, while you are in image access, replication continuous. The difference between replica's PiT and journal is called journal lag and it will keep increasing. When you exit image access, you will have to "catch up". Again, depending on how write intensive your production, it may take more time to catch up to the latest image.
As you can see, the performance of journal LUNs is very important. It is a good idea to create multiple journal LUNs per replica copy (RP will stripe across them) and spread them across different Fibre Channel drives for best performance. Also, the journal size is less of a function of production size and more of a function of the production rate of change.
We also recommend to use snapshot consolidation for older snapshots to save journal space and speed up the roll back, because it is less likely that you will need a micro-second granularity for anything older than a couple of days.
Having said all that, in many cases, if you plan to stay in image access for some serious testing, we recommend to use array-based clones instead. The big advantages are: you are bypassing the splitter, you are not limited by the journal, and the journal lag is not growing.
Hope it helps.
Thanks for great reply. You provide great information. I will flesh out my situation. We are using SnapView Clones for most our our Host Estate during DR Testing. Specifically for virtual AIX (presented over NPIV - so like physical AIX from storage viewpoint) and Physical Windows we have full independant SnapView Clones for prolonged DR Tests. With the clones DR Tests can go on for as long as required. It is important to note these are full clones not snaps.
However these hosts are ESX Servers with Windows VMs as guests. We are also using VMware Site Recovery Manager (SRM). SRM doesn't like independant clones and it doesn't like having 2 x Storage Replication Adapters installed. SRM doesn't like independant clones as one of its features is failback. So in simple terms it wants to be able to manage both DR test / invocation and potential roll back.
I know I could have SRM invoked in test mode and then clone the SRM test mode LUNs but this means a lot of the SRM work becomes manual. For example import of Clone LUNs, Browse / import of clones VMX Files (VMS) and renaming of cloned VMs in Virtual Centre to avoid conflict with Production to CRR SRM jobs.
Now back to RecoverPoint. I have CLARiiON based Splitters. I will definately increase the TSP setting by a reasonable amount. Suppose I have 10 days worth of journals and I mount today. Do I know have roughly 9 to 10 days of Production Writes possible that will get stored in the journals while the CRR LUN is frozen at my mount point?
This comes down to what happens to the distribution queues while in CRR LUN is mounted in image access mode. Also how the distribution queues work in general.
I am not worried about a long roll up of impending journal entries into CRR LUN after a long DR Test. As long as RP can still protect production by replicating to the journals I am ok really. You could argue long wait for roll up if I have a disaster during testing but this is a risk I am will to take.
Any further information much appreciated.
SRM test will always pick the latest image and use physical access mode (in the recent versions). In that case:
- you don't have to worry about roll back time, because you are initially selecting the latest image
- you still need to monitor if you have enough journal space for TSP (20% by default)
- you still need to monitor to make sure your journal lag does not exceed the journal capacity (~80%)
- you still need to wait to "catch up" after you are done with image access. (I would invest in enough physical spindles and create multiple journal LUNs for optimum performance)
In terms of RP internals, the distribution process is normally 5 phases (discussed in detail in the RecoverPoint Admin Guide). The data is first written to the journal (step 1) and then distributed to the replica (steps 2-5). When you enable image access, step 1 still occurs, but steps 2-5 are on hold until you disable image access. That's when we will "catch up" with the distribution.
So, if you size everything correctly, journal snapshot area, journal TSP area and journal performance, monitor all of the above, you may be able to do it.
Getting back to array clones idea, here is how it can work:
1. Enable physical image access
2. Take an array-based clone
3. Disable image access
4. Present the clone to ESX
5. Mount data stores / start VMs, etc.
We have a GUI called Replication Manager, which can automate all of the above in a single job. You could potentially save on journal space at the expense of the clone space, which could actually result in net savings. Plus, no worrying about journal capacity and journal lag. However, you would need to purchase / learn / configure Replication Manager. Just another option to consider.