: This topic is part of the Using Hadoop with OneFS - Isilon Info Hub
In Isilon and Cloudera Backup and Disaster Recovery Integration
we reviewed Cloudera BDR integration for HDFS replication between a DAS cluster and an Isilon Cluster. In this post we will close the loop on BDR replication and review how to setup and integrate Hive replication
- CDH 5.8 and greater
- UID/GID parity - through local accounts or LDAP, parity in uid and gid is important to maintain consistent access across storage
- DAS Cloudera cluster setup complete
- Isilon Cloudera cluster setup complete
- DNS Name resolution fully functional - all host, forward and reverse
- Both the source and destination clusters must have a Cloudera Enterprise license
Note the following when using replication jobs for clusters with Isilon:
- hdfs user is mapped to root on Isilon, If you specify alternate users with the Run As option when creating replication schedules, those users must also be superusers.
- Always select the 'Skip Checksum Checks' property when creating replication schedules.
- Kerberos authentication is fully supported from CDH 5.8 and higher, the account used to replicate data will need a principal and keytab to enable authentication against the target, see the Cloudera documentation for additional information on configuring this.
- Data replication can fail if the source data is modified during replication, it is therefore recommended to leverage snapshots as the source of data replication. If enabled replication can automatically make use of snapshots to prevent this issue. For more details see the following Cloudera documentation Using Snapshots with Replication
- Source clusters that use Isilon storage do not support HDFS snapshots. Since snapshots are used to ensure data consistency during replications in scenarios where the source files are being modified. Therefore, when replicating from an Isilon cluster source, it is recommended that you do not replicate Hive tables or HDFS files that could be modified before the replication completes without taking additional steps to ensure data replication succeeds effectively. Additional options would be to leverage SyncIQ to replicate data between Isilon clusters or using Isilon native snapshots in conjunction with metastore replication.
In our example we have loaded a sample set of data for use by Impala on our DAS Cloudera cluster, since Impala shares the Hive metastore database we can use BDR Hive replication to replicate this Impala database and the HDFS data to our Isilon Cloudera cluster. This illustrates that both Hive and Impala based databases and the HDFS based tables can be replicated with BDR.
1. In Hue, we see the tpcds_parquet
database in the impala/hive
2. The tpcds__parquet
table definition and information can be seen here in Hue
3. The data for the tables is seen here in the /user/hive/warehouse
4. Run a sample Impala query to validate the data on the DAS
5. On the Isilon cluster, the tpcds_parquet
database, tables and HDFS data do not exist
6. Since we have already created a replication Peer in blog post 1 we can move straight on to setting up Hive/Impala replication using the Cloudera Backup tools
7. Select the DAS cluster as source; a replication schedule and which databases to replicate can be defined here. Also the Run As Username; any user will need superuser permissions and kerberos enablement if the clusters use kerberos.
Again, make sure to always check "Skip Checksum Checks" as the target is Isilon.
You also have the option to override the location of the exported metadata and location of the HDFS data is replicated to, for more details see: Hive/Impala Replication
If the source HDFS data is not enabled for snapshots, you'll see the following information. It is highly recommended to use snapshots with Hive/Impala replication. To configure this, make the source HDFS data default location - /user/hive/warehouse snapshottable. BDR will now automatically make use of this feature when replicating data.
We have enabled snapshots on the default location for data: /user/hive/warehouse
8. Having defined the schedule, execute it
9. The replication then executes copying the metadata & data: we see it copy the database, tables and HDFS data
10. We can now see the tpcds__parquet
database in the metastore, the BDR job take care of location specific URI and paths relating to the metadata and data now being on a different Hadoop cluster, this is the critical piece of Hive/Impala replication and why using BDR is so useful.
11. Running a simple SQL query against the customer table on both clusters validates the database, table and HDFS data replication was successful.
On DAS cluster
On Isilon Cluster
Hive Replication and Incremental Replication
1. Drop a Hive/Impala Table on the Isilon cluster
2. Execute replication, Incremental Replication only copies the data for the missing table
3. Update Hue's view of the metastore data
The table is now present and can be queried
Having now replicated the Hive/Impala metastore data and underlying table data on HDFS to the Isilon cluster, we can again leverage exciting native Isilon features to protect this data further; Snapshots, SyncIQ, NDMP backup etc..This short demonstration illustrate how Cloudera BDR can be used to backup and replicate HDFS data between Hadoop DAS clusters and Isilon integrated Hadoop clusters easily.