VTL Pool lagging replication issues

Question

Hello, I have two identical DataDomain's, one receive backups from an AS400 via VTL. Site A VTL Pool is replicated to Site B which is working fine.

We started with a DEV system backup and it replicated to Site B. We got replication lag warnings etc., but eventually it finished.

As the DEV and PRD AS400 boxes contain about the same information (DEV is a replica from PRD) we then enabled backups from the PRD box to hit the DD. To my surprise we are experiencing the same lag issues and very long replication times.

The VTL is configured as LTO 5, so I think each full PRD backup is less than 3TB, but the DD units don't seem to send only the deltas, instead are doing completed replications.

I know the DDBoost is not supported with VTL, but I would have expected to have two identical DD units to be "smart" and only replicate what is needed. Each full backup adds about the same GB as the previous for replication.

There are no network issues that we know and currently running OS 5.7.10.

Any ideas?

James_Ford · Answer

Hi Paul,

So from the screen shot above you are using pool based replication (which is mtree replication under the covers). Mtree replication works as follows (this is a very basic overview):

- Takes mtree snapshots on source and destination

- Compares snapshots on source and destination to work out the differences (i.e. which files are different on the source and therefore need to be replicated to the destination)

- Replicates the differences from source -> destination such that the snapshot on the source can be built on the destination

- Once replication is complete and the mtree snapshot from the source built on the destination, the destination 'exposes' the new snapshot (i.e. its contents are used to replace the contents of the mtree on the destination)

As mentioned above snapshot comparison is used to work out which files need to be replicated. For each file the source will:

- Read the file to work out what data it contains

- Ask the destination if it already has a copy of this data

- Physically replicate any data which the source doesn't already have

As a result mtree replication is 'de-dupe aware' and will only send data that the destination doesn't already have rather then performing 'full' replication of any new files.

Note that:

- Replication contexts don't report how much unique/de-duplicated data they have left to send - instead they report how much pre-compressed (i.e. logical/unhydrated) data they have left to send. In the screen shot above it shows that the first context has ~2981Gb pre-compressed data remaining - as replication is de-dupe aware replication will not need to send 2981Gb physical data over the wire

- To further illustrate this the replication context reports that it is achieving a compression ratio of ~19x (meaning that it is only sending 1Kb of physical data for ~19Kb logical data)

If replication is slow/lagging then the main reasons for this are generally:

- A lack of network bandwidth between the DDRs (or bandwidth throttling being imposed). I notice that you have low bandwidth optimisation enabled so imagine that you do not have good bandwidth between systems as this option is only designed for connections of < 6Mbps

- Poor read performance on the source DDR - as the source DDR needs to read the file to work out what data it references poor read performance (which can be caused by a number of factors) can cause poor replication performance

If you would like this looked at more closely/explained in more detail I would recommend opening a support case.

Thanks, James

jmbaxt1 · Answer

We have a very similar scenario. As400 using VTL and replication between 2 sites. Did you end up discovering something that helped improve replication?
We have a 50MB VPN tunnel but it's taken 3 days to replicate the 65GB post-comp data..Also have you taken a look at your iostat and system performance while replication is running? For some reason we're only pushing out 900kb/s.

Ryan_Johnson · Answer

You could confirm there isn't any data domain replication throttles set. If one is set it will limit the replication bandwidth used.

Also you could run iperf between the two data domains to confirm there is the expected bandwidth. Sometimes although there is a VPN tunnel with a certain bandwidth it is not always dedicated to data domain replication. iperf is built into the data domain with the "net iperf client" and "net iperf server" commands. If iperf doesn't go as fast as you expect its probably a network issue and not the data domain.

Data Domain

VTL Pool lagging replication issues

Was this post helpful?