Start a Conversation

Unsolved

This post is more than 5 years old

4550

August 31st, 2009 16:00

Why does RP enter "Regulated" role on source during CDP

I have RP 3.1 SP1, using CLARiiON splitter, on six VMWare volumes of 500GB each, each in its own consistancy group. Replicating from a CX4-480 to a CX3-40, journal files are 30GB.

I have not checked "Allow Regulation" in the policies, but several of the CGs are in Regulated mode, and also are sometimes going into high load mode.

I get error reports 4133 Copy regulation has started, and 12034 Writes to storage may have occurred without corresponding writes to RPA. Also ERROR & CLEAR 16034 Writes to storage may have occurred without corresponding writes to RPA. Problem has been corrected.

Can someone explain what's happening?

Thanks,
Dave

2.2K Posts

November 25th, 2009 08:00

I found this post when searching on Powerlink for the same issue that Dave reported. I have four CGs for one SQL database, and yesterday I enabled Image Access (logged), and mounted the replicas on a backup host to perform a backup of the database at our DR site. The backup started when I left and this morning when I checked on the status of RP I noticed that those four groups (and no others) were in 'Regulated' mode even though I do not have that option enabled. After disabling image access (the backup had completed) the 'Regulated' state went away but I then noticed that all the groups were 'undoing writes'. So it appears my DBA wrote the SQL backup file to the Replicas and this caused the CGs to go into a Regulated state?

I could not find much information on the Regulated state and what the impact is to the production hosts or replication. Anyone else experience this?

Aran

117 Posts

November 25th, 2009 19:00

The “Allow regulation” check box under the Consistency Group Protection Policy has nothing to do with a Consistency Group being placed in regulated state. I know this is a bit confusing so let me explain the difference.

The “Allow Regulation” Check box in the policy tells RecoverPoint that in the event that RecoverPoint cannot replicate the data being changed on the LUNS in a consistency group, do not allow the host to continue , i.e no acknowledgement of the write completion unless RecoverPoint has recorded the change.   We typically do not recommend this setting and it is there for customer that would rather impact application performance rather than have the possibility of replication lag due to bandwidth constraints of some component in the replication infrastructure.

In RP 3.1 we introduced a protection mechanism called Control action regulation. In the rare case a consistency group copy is going in and out or states (highload, paused, init, active , back to highload) or otherwise operating improperly in the system and with the potential of impacting other consistency groups, the copy will be placed in “Regulated state”, in which the system will protect itself by not taking any control action for the copy for 24 hours. This feature helps limit the adverse affect of the copy operating improperly effecting other copies.

Reading the description of your scenario , you may be experience a bug that was in the initial release of Control Action Regulation which can effect all the CG's not just the regulated CG's, and a fix has been released for this, see Primus emc222454 for a workaround (unregulate_all_copies command from the CLI prompt.) for the versions prior to 3.2.2 

-rick

2.2K Posts

November 30th, 2009 08:00

Rick,

Awesome information, thanks for the reply, that was very helpful. I will take a look at the article and apply it to our environment.

Thanks,

Aran

9 Posts

December 11th, 2009 08:00

Thanks for the good explanation about "Regulated".

Can you tell me why a CG would enter High Load mode in CDP? It's supposed to be synchronous replication, and High Load means that replication is behind. Since it's a FC link to both the source and the replica, why do some of the CGs enter High Load, and what can I do to remedy this?

By the way, I am working on scheduling an upgrade to the latest software.

Dave

2.2K Posts

December 11th, 2009 11:00

Dave,

I think that has to do with the load of the i/o between the production host and the production LUNs. If the RPA cannot keep up with the i/o that the production host is generating to the production LUNs then you would see this message.

Aran

9 Posts

December 29th, 2009 12:00

thanks Rick, that sheds some light, but it's a limitation that's not stated in the data sheets or anywhere else I could find. I bought 17TB worth of CDP licenses so I could do synchronous replication between two local data centers, and that's a pretty severe limitation.

I have an SR open on this that's been escalated to engineering, haven't heard back from them yet.

Thanks,

Dave

117 Posts

December 29th, 2009 12:00

AranH is correct., the RecoverPoint appliances are rated at sustained change rate of 75MB/s today, irregardless of the media used for replication.  If the amount of writes exceeds this for a sustained period on a per RPA basis  then you would see a highload.. which simply means that the RPA can not keep up, so it reverts to marking mode (tracking the changes, but not replicating them) then when the high change rate dissipates, the RPA will use the marking data to catch up.

-rick

117 Posts

December 29th, 2009 16:00

David,

    The performance characteristics are published in the Release Notes

  Just to make sure we are not mis communicating  and you  have the total picture.  The 17TB replication license is not tied to the change rate.. but the amount of usable storage seen by the host that need to be protected.  The 75Mb/s per RPA is a resource limit by design.. much like an ESX server only has a given amount of memory or CPU capacity..so you scale out by adding more ESX servers.. same for RP.. you scale out by adding RPA's.. You can have up to 8 RPA's per site.. which means the sustained change rate that can be supported is 600MB/s.  This equates to appx 2.1TB per hour per RP cluster.. then you scale out to multiple cluster , again much like you would for ESX farms.

What is the SR#?

Can I ask what your RPO/RTO's are?

How long does the system stay in highload?

Have you run detect_bottlenecks command to get an idea of what is triggering the highload?

I noticed you are using vmware, do you use the best practice of dedicating non-replicated vmware swap file devices for all vm swap space, if you plan to over commit memory? 

The reason I ask is that the swap files are zero'd out anytime you power on or cycle a VM ..which means for a VM with 4GB of ram .. this cause 4GB of change at ever power cycle or new vm creation.  The swap file is of no use for DR so does not need to be replicated and as a BP vmware recommends not replicating and dedicating a device per ESX server for swap space use.

-rick

9 Posts

December 30th, 2009 07:00

Rick,

I find nothing in the release notes about any performance characteristics or limitations, and nothing about 75MB limit.


SR# 32640208, all logs and reports have been uploaded there.I received a fairly unhelpful response from engineering, relayed through the support engineer.


The CG that is the problem is a large SQL Server database, 2TB data lun, 300GB log lun. It does have heavy i/o.

It is the main database that runs much of our business. RPO and RTO is minutes.


The bottlenecks report indicates:


on box: RPA1

Incoming writes rate for link           : 5.81645 Megabytes/sec

                              Peak Value: 82.4774 Megabytes/sec


The CG almost never completes init, remains in high load mode and cycles between 0% and 10% complete. Given that the peak writes on the box is 82, it would seem that I'm still well within 75MB sustained. In the last couple weeks when it did complete initialization, it went back to high load fairly quickly.


Reviewing the real-time performance graphs that are available on the RP management app shows a high level of writes, but cpu utilization on RPA1 under 10, application traffic averaging 250Mb/s with bursts to 500-600.

We're working on evaluating the vmware workload, however, the guests are all active servers and rarely shut down or reboot.

When we bought the CDP licenses the possibility of needing additional RPAs never came up, there was no indication that there could be a performance limitation, and given that cpu utilization on all of them was so low, it didn't seem that would be a problem.

I do have a problem, however, with the notion of having to buy additional RPAs to handle this local CDP, and then being forced to buy matching RPAs for the DR site which clearly doesn't need them.

I'm going to try moving everything except this one CG to the other RPA and see if it starts making progress on the initialization.

Thanks,
Dave

.

2.2K Posts

December 30th, 2009 08:00

David,

In all the release notes there is a section titled Performance with a table listing the "Average Sustained RPA Throughput MB/s". That is where the 75MB/s figure is listed.

It sounds like a pretty frustrating implementation that you are working on. I have been using RP for almost three years now replicating only SQL server databases. We had a lot of challenges intitially with performance and it took a bit of work to finally find a configuration that worked for our SQL CGs. The databases that we are replicating are OLTP type databases with a fairly high iop and change rate. One of the SQL databases is almost 3TB in size. What we found that helped with the replication throughput was splitting the data and log LUNs across multiple CGs and then created a Group Set to keep all the CGs in a consistent synch. I know that you stated that the data LUN is 2TB in size, but is it possible to work with the DBAs to split the data files across multiple LUNs? A SQL performance best practice is to have more smaller LUNs supporting the data and logs to provide increased throughput. This would also enable you to use multiple CGs which would provide more throughput. If this is not possible then you could at least separate the DB and Log LUN in separate CGs and create a Group Set for them. This may help with the init and peak loads.

No Events found!

Top