In our example here /user/test1; the source is native HDFS so we can enable snapshots on the directory to be replicated, Cloudera can then automatically make use of the 'directory enabled for snapshots feature' and use a snapshot as the source of replication.
Review the directory with the HDFS file browser in Cloudera Manager
In our example, we use a local user to generate some test data, a corresponding user on Isilon exists with the same uid and gid membership.(this could be an LDAP user also)
$ su - test1
$ cd /opt/cloudera/parcels/CDH/jars
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar teragen 1000000 /user/test1/gen1
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar terasort /user/test1/gen1 /user/test1/sort1
$ yarn jar /hadoop-mapreduce-examples-2.6.0-cdh5.11.1.jar teravalidate /user/test1/sort1 /user/test1/validate1
Now lets setup replication of this data from the DAS cluster to Isilon:
1. Configure a Replication Peer on the Source (Isilon Cluster), Select Peers from the backup Tab on the Isilon Cloudera Manager
2. Add a Peer
3. Name the Peer, in this example we use 'DAS' to make it easy, add the peer URL and the credentials to logon to the Target(DAS) Cloudera Manager
4. The Peer is validated as connected
5. Now, lets create a HDFS Replication Schedule from the Backup menu
6. From the drop select the Source; the 'DAS' cluster, the source path, destination 'Isilon' cluster and the destination path to replicate to:
A schedule can be set as needed; we select daily at 00:00AM PDT
We run this job as hdfs, since we wish to replicate the source Permissions the Run As User must have superuser privilege on the target cluster; if kerberos is in use additional steps need to be completed to enable the run as user to authenticate successfully against the target cluster.
Select the Advanced Tab
Select 'Skip Checksum Checks' -- this must be done, otherwise replication will fail
Additional setting can be used that are specific to your environment and your requirements
7. The replication policy is now available
8. Before executing a data copy, we can execute a dry run to validate and evaluate the replication policy.
9. On execution of a successful dry run, the job can be run manually or wait for the scheduled job to run to copy data
Review the job on completion, the details of the distcp and options can be seen along with additional other information regarding the job
10. Compare the Source and Target directories; we see the data has been replicated maintaining permissions.
Source DAS cluster - /user/test1
Target Isilon cluster - /DAS/user/test1
Using HDFS replication is incremental aware.
1. Add new data to DAS - /user/test1 - gen2, sort2,validate2, tpcds
Reviewing the Source DAS cluster data - /user/test1
2. execute a replication and review the results, only the new data was copied as expected
As can be seen using HDFS replication is pretty straightforward and can be used to maintain a well structured and scheduled backup methodology for large HDFS data sets. Now, since the data is resident on Isilon additional backup methodologies can be leveraged; SyncIQ copies to other Isilon clusters, Isilon Snapshots, NDMP backups and tiering.
In the next post we will look at how Hive/Impala replication is enabled for integration between two Cloudera clusters -- > Isilon and Cloudera Backup and Disaster Recovery Integration - Hive Metastore and Data Replication
Article ID: SLN319503Last Date Modified: 01/31/2020 01:48 PM