I found this post when searching on Powerlink for the same issue that Dave reported. I have four CGs for one SQL database, and yesterday I enabled Image Access (logged), and mounted the replicas on a backup host to perform a backup of the database at our DR site. The backup started when I left and this morning when I checked on the status of RP I noticed that those four groups (and no others) were in 'Regulated' mode even though I do not have that option enabled. After disabling image access (the backup had completed) the 'Regulated' state went away but I then noticed that all the groups were 'undoing writes'. So it appears my DBA wrote the SQL backup file to the Replicas and this caused the CGs to go into a Regulated state?
I could not find much information on the Regulated state and what the impact is to the production hosts or replication. Anyone else experience this?
The “Allow regulation” check box under the Consistency Group Protection Policy has nothing to do with a Consistency Group being placed in regulated state. I know this is a bit confusing so let me explain the difference.
The “Allow Regulation” Check box in the policy tells RecoverPoint that in the event that RecoverPoint cannot replicate the data being changed on the LUNS in a consistency group, do not allow the host to continue , i.e no acknowledgement of the write completion unless RecoverPoint has recorded the change. We typically do not recommend this setting and it is there for customer that would rather impact application performance rather than have the possibility of replication lag due to bandwidth constraints of some component in the replication infrastructure.
In RP 3.1 we introduced a protection mechanism called Control action regulation. In the rare case a consistency group copy is going in and out or states (highload, paused, init, active , back to highload) or otherwise operating improperly in the system and with the potential of impacting other consistency groups, the copy will be placed in “Regulated state”, in which the system will protect itself by not taking any control action for the copy for 24 hours. This feature helps limit the adverse affect of the copy operating improperly effecting other copies.
Reading the description of your scenario , you may be experience a bug that was in the initial release of Control Action Regulation which can effect all the CG's not just the regulated CG's, and a fix has been released for this, see Primus emc222454 for a workaround (unregulate_all_copies command from the CLI prompt.) for the versions prior to 3.2.2
Awesome information, thanks for the reply, that was very helpful. I will take a look at the article and apply it to our environment.
Thanks for the good explanation about "Regulated".
Can you tell me why a CG would enter High Load mode in CDP? It's supposed to be synchronous replication, and High Load means that replication is behind. Since it's a FC link to both the source and the replica, why do some of the CGs enter High Load, and what can I do to remedy this?
By the way, I am working on scheduling an upgrade to the latest software.
I think that has to do with the load of the i/o between the production host and the production LUNs. If the RPA cannot keep up with the i/o that the production host is generating to the production LUNs then you would see this message.
AranH is correct., the RecoverPoint appliances are rated at sustained change rate of 75MB/s today, irregardless of the media used for replication. If the amount of writes exceeds this for a sustained period on a per RPA basis then you would see a highload.. which simply means that the RPA can not keep up, so it reverts to marking mode (tracking the changes, but not replicating them) then when the high change rate dissipates, the RPA will use the marking data to catch up.
thanks Rick, that sheds some light, but it's a limitation that's not stated in the data sheets or anywhere else I could find. I bought 17TB worth of CDP licenses so I could do synchronous replication between two local data centers, and that's a pretty severe limitation.
I have an SR open on this that's been escalated to engineering, haven't heard back from them yet.
The performance characteristics are published in the Release Notes
Just to make sure we are not mis communicating and you have the total picture. The 17TB replication license is not tied to the change rate.. but the amount of usable storage seen by the host that need to be protected. The 75Mb/s per RPA is a resource limit by design.. much like an ESX server only has a given amount of memory or CPU capacity..so you scale out by adding more ESX servers.. same for RP.. you scale out by adding RPA's.. You can have up to 8 RPA's per site.. which means the sustained change rate that can be supported is 600MB/s. This equates to appx 2.1TB per hour per RP cluster.. then you scale out to multiple cluster , again much like you would for ESX farms.
What is the SR#?
Can I ask what your RPO/RTO's are?
How long does the system stay in highload?
Have you run detect_bottlenecks command to get an idea of what is triggering the highload?
I noticed you are using vmware, do you use the best practice of dedicating non-replicated vmware swap file devices for all vm swap space, if you plan to over commit memory?
The reason I ask is that the swap files are zero'd out anytime you power on or cycle a VM ..which means for a VM with 4GB of ram .. this cause 4GB of change at ever power cycle or new vm creation. The swap file is of no use for DR so does not need to be replicated and as a BP vmware recommends not replicating and dedicating a device per ESX server for swap space use.
I find nothing in the release notes about any performance characteristics or limitations, and nothing about 75MB limit.
SR# 32640208, all logs and reports have been uploaded there.I received a fairly unhelpful response from engineering, relayed through the support engineer.
The CG that is the problem is a large SQL Server database, 2TB data lun, 300GB log lun. It does have heavy i/o.
It is the main database that runs much of our business. RPO and RTO is minutes.
The bottlenecks report indicates:
on box: RPA1
Incoming writes rate for link : 5.81645 Megabytes/sec
Peak Value: 82.4774 Megabytes/sec
The CG almost never completes init, remains in high load mode and cycles between 0% and 10% complete. Given that the peak writes on the box is 82, it would seem that I'm still well within 75MB sustained. In the last couple weeks when it did complete initialization, it went back to high load fairly quickly.
Reviewing the real-time performance graphs that are available on the RP management app shows a high level of writes, but cpu utilization on RPA1 under 10, application traffic averaging 250Mb/s with bursts to 500-600.
We're working on evaluating the vmware workload, however, the guests are all active servers and rarely shut down or reboot.
When we bought the CDP licenses the possibility of needing additional RPAs never came up, there was no indication that there could be a performance limitation, and given that cpu utilization on all of them was so low, it didn't seem that would be a problem.
I do have a problem, however, with the notion of having to buy additional RPAs to handle this local CDP, and then being forced to buy matching RPAs for the DR site which clearly doesn't need them.
I'm going to try moving everything except this one CG to the other RPA and see if it starts making progress on the initialization.