Note: This topic is part of the Using Hadoop with OneFS - PowerScale Info Hub.
This article describes the recommended best practices for the backup of non-PowerScale Hadoop environments to a Dell EMC PowerScale cluster. With its very robust erasure-coding data protection that provides greater than 80% storage efficiency, Dell EMC PowerScale is an ideal backup target for data located on a Hadoop cluster. DistCp (distributed copy) is a standard tool that comes with all Hadoop distributions and versions. DistCp can copy entire Hadoop directories. DistCp runs as a MapReduce job to perform file copies in parallel, fully utilizing your systems if desired. There is also an option to limit the bandwidth to control the impact on other tasks.
This article uses the following test environment:
Because DistCp is a standard Hadoop tool, the approach outlined in this document is applicable to most, if not all other Hadoop distributions and versions.
While reading this document, assume that the data to back up is located on the PHD Hadoop HDFS cluster in the directory /mydata. The examples back up this data to the PowerScale cluster in the directory /ifs/hadoop/backup/mydata.
THE SIMPLEST BACKUP METHOD
The simplest backup command is shown below:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata hdfs://all-nc-s-hdfs/backup/mydata
You can run the above command on any host with the Hadoop client (hadoop) installed. The user running the command must have permissions to read the source files and write the target files.
The options -skipcrccheck and -update must be specified to avoid the CRC check on the target files that will be placed on the PowerScale cluster. PowerScale does not store the Hadoop CRC, and calculating it would be too expensive. Therefore, these options are required to prevent errors related to the CRC check.
The next parameter "/mydata" is the source path on the source Hadoop cluster. This could also be "/" to back up your entire HDFS namespace. Note that since the path is not fully-qualified, it uses the HDFS NameNode specified in the fs.defaultFS parameter of core-site.xml.
The final parameter "hdfs://all-nc-s-hdfs/backup/mydata" is the target path on your PowerScale cluster. The host portion "all-nc-s-hdfs" can be a relative or fully-qualified DNS name such as all-nc-s-hdfs.example.com. It should be the SmartConnect Zone DNS name for your PowerScale cluster. The directory portion "/backup/mydata" is relative to the HDFS root path defined in your PowerScale cluster access zone. If your HDFS root path is /ifs/hadoop, then this value refers to /ifs/hadoop/backup/mydata.
Files whose sizes are identical on the source and target directories are assumed to be unchanged and are not copied. In particular, file timestamps are not used to determine changed files. For more details on DistCp, refer to http://hadoop.apache.org/docs/r1.2.1/distcp2.html.
By default, the owner, group, and permissions of the target files are reset to the default for new files created by the user initiating DistCp. Any owner, group, and permissions defined for the source file are lost. To retain this information from the source files, use the -p option. Because the -p option must perform chown/chgrp, the user initiating DistCp must be a super-user on the target system. The root user on the PowerScale cluster works for this purpose. For example:
[root@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update -pugp /mydata hdfs://all-nc-s-hdfs/backup/mydata
USING SNAPSHOTS FOR YOUR BACKUP SOURCE
The backup of large datasets may take a very long time. Files that exist at the beginning of the DistCp process, when the directory structure is scanned, may no longer exist when that file is actually copied, which produces errors. Further, an application may require a consistent single point-in-time backup for it to be usable. To deal with these issues, it is recommended that you create an HDFS snapshot of your source to ensure that the dataset does not change during the backup process. Note that this is unrelated to the SnapshotIQ feature of your target PowerScale cluster.
To use HDFS snapshots, you must first allow snapshots for a particular directory:
[gpadmin@phddas2-master-0 ~]$ hdfs dfsadmin -allowSnapshot /mydata
Allowing snapshot on /mydata succeeded
[gpadmin@phddas2-master-0 ~]$ hdfs dfs -createSnapshot /mydata backupsnap
Created snapshot /mydata/.snapshot/backupsnap
The name of this snapshot is backupsnap. You can access it at the HDFS path /mydata/.snapshot/backupsnap. Any changes to your HDFS files after this snapshot are not reflected in the subsequent backup. You can back up the snapshot to PowerScale using the following command:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata/.snapshot/backupsnap hdfs://all-nc-s-hdfs/backup/mydata
[gpadmin@phddas2-master-0 ~]$ hdfs dfs -deleteSnapshot /mydata backupsnap
To create snapshots on PowerScale, you must have a SnapshotIQ license. You can create snapshots using the web admin interface or CLI. To create a single PowerScale snapshot manually with the CLI, SSH into any PowerScale node and run the following:
all-nc-s-1# isi snapshot snapshots create /ifs/hadoop/backup/mydata --name backup-2014-07-01 --expires 1D --verbose
Created snapshot backup-2014-07-01 with ID 6
You can add this command to the backup process discussed in the Scheduling Backups section below.
For more details regarding PowerScale OneFS snapshots, refer to the PowerScale OneFS CLI Administration Guide for your version of OneFS.
SYNCIQ REPLICATION FOR MULTIPLE PowerScale CLUSTERS
After the DistCp backup to the PowerScale cluster completes, you can use OneFS SyncIQ to replicate snapshots across a WAN to other PowerScale clusters. Replicated snapshots can provide a very versatile and efficient component of your disaster recovery strategy.
HANDLING DELETED FILES
By default, files deleted from the source Hadoop cluster are not deleted from the target Hadoop cluster. If you require this behavior, add the -delete argument to the DistCp command. When using this command, it is recommended to use snapshots on the backup target to allow for the recovery of deleted files.
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hdfs://all-nc-s-hdfs/backup/mydata /mydata
You may want to create a snapshot of the target directory so that you can undo any mistakes made during the recovery process. However, be aware of the additional disk usage needed to maintain snapshots.DIRECT ACCESS TO BACKUP DATA USING HDFS
The backup target files on PowerScale are accessible from Hadoop applications in the same way as the source files, due to PowerScale’s support for HDFS. You can use your backup data directly, without having to first restore it to your original source Hadoop environment. This capability saves analysis time. For example, if you normally run a MapReduce command like this:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep /mydata/mydataset1 output1 ABC
You can run the MapReduce job against the backup dataset on PowerScale using the following command:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep hdfs://all-nc-s-hdfs/backup/ /mydata/mydataset1 output1 ABC
To specify a fully-qualified Hadoop path instead of using the fs.defaultFS parameter, check with your application provider for details. Also, an PowerScale cluster that is designed for backup and archive, instead of for high performance, is likely not to provide the same performance as your primary Hadoop environment. Testing is recommended, or consult with Dell EMC PowerScale for proper sizing.RECOVERY FROM PowerScale SNAPSHOTS
In the case where PowerScale does not support your Hadoop version, you can still use DistCp to backup and restore your Hadoop data with PowerScale using HFTP. For instance, PHD 2.0 and later is not supported on PowerScale OneFS 7.1.1 and earlier. In this configuration, you need to build a small Hadoop cluster using a version of Hadoop that is directly supported by PowerScale, then run DistCp on this new cluster using the HFTP protocol to access your source data on your original Hadoop cluster. The HFTP protocol is a read-only file system that is compatible across different versions of Hadoop. For example:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hftp://phddas2-namenode-0/mydata hdfs://all-nc-s-hdfs/backup/mydata
The size of the new small cluster that will run the DistCp MapReduce job primarily depends on how much throughput is required. If you only need to back up at the rate of 10 Gbps, then you only need a single Hadoop node. None of your data will be stored on this small Hadoop cluster so disk requirements are minimal.
Article ID: SLN319148Last Date Modified: 07/08/2020 06:08 PM