Backing Up Hadoop To PowerScale


Backing Up Hadoop To PowerScale




Note: This topic is part of the Using Hadoop with OneFS - PowerScale Info Hub.


Best Practices for using DistCp to Back Up Hadoop

This article describes the recommended best practices for the backup of non-PowerScale Hadoop environments to a Dell EMC PowerScale cluster. With its very robust erasure-coding data protection that provides greater than 80% storage efficiency, Dell EMC PowerScale is an ideal backup target for data located on a Hadoop cluster. DistCp (distributed copy) is a standard tool that comes with all Hadoop distributions and versions. DistCp can copy entire Hadoop directories. DistCp runs as a MapReduce job to perform file copies in parallel, fully utilizing your systems if desired. There is also an option to limit the bandwidth to control the impact on other tasks.

ENVIRONMENT

This article uses the following test environment:

  • Pivotal HD (PHD) 2.0.1, installed using Pivotal Control Center 2.0. All settings use default values. In particular, HDFS is installed on the PHD nodes for a traditional DAS configuration.
  • PowerScale OneFS 7.2.0

Because DistCp is a standard Hadoop tool, the approach outlined in this document is applicable to most, if not all other Hadoop distributions and versions.

While reading this document, assume that the data to back up is located on the PHD Hadoop HDFS cluster in the directory /mydata. The examples back up this data to the PowerScale cluster in the directory /ifs/hadoop/backup/mydata.


Image

BACKUP METHODS

THE SIMPLEST BACKUP METHOD
The simplest backup command is shown below:

[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata hdfs://all-nc-s-hdfs/backup/mydata

You can run the above command on any host with the Hadoop client (hadoop) installed. The user running the command must have permissions to read the source files and write the target files.

The options -skipcrccheck and -update must be specified to avoid the CRC check on the target files that will be placed on the PowerScale cluster. PowerScale does not store the Hadoop CRC, and calculating it would be too expensive. Therefore, these options are required to prevent errors related to the CRC check.

The next parameter "/mydata" is the source path on the source Hadoop cluster. This could also be "/" to back up your entire HDFS namespace. Note that since the path is not fully-qualified, it uses the HDFS NameNode specified in the fs.defaultFS parameter of core-site.xml.

The final parameter "hdfs://all-nc-s-hdfs/backup/mydata" is the target path on your PowerScale cluster. The host portion "all-nc-s-hdfs" can be a relative or fully-qualified DNS name such as all-nc-s-hdfs.example.com. It should be the SmartConnect Zone DNS name for your PowerScale cluster. The directory portion "/backup/mydata" is relative to the HDFS root path defined in your PowerScale cluster access zone. If your HDFS root path is /ifs/hadoop, then this value refers to /ifs/hadoop/backup/mydata.

Files whose sizes are identical on the source and target directories are assumed to be unchanged and are not copied. In particular, file timestamps are not used to determine changed files. For more details on DistCp, refer to http://hadoop.apache.org/docs/r1.2.1/distcp2.html.

COPYING PERMISSIONS
By default, the owner, group, and permissions of the target files are reset to the default for new files created by the user initiating DistCp. Any owner, group, and permissions defined for the source file are lost. To retain this information from the source files, use the -p option. Because the -p option must perform chown/chgrp, the user initiating DistCp must be a super-user on the target system. The root user on the PowerScale cluster works for this purpose. For example:

[root@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update -pugp /mydata hdfs://all-nc-s-hdfs/backup/mydata

USING SNAPSHOTS FOR YOUR BACKUP SOURCE
The backup of large datasets may take a very long time. Files that exist at the beginning of the DistCp process, when the directory structure is scanned, may no longer exist when that file is actually copied, which produces errors. Further, an application may require a consistent single point-in-time backup for it to be usable. To deal with these issues, it is recommended that you create an HDFS snapshot of your source to ensure that the dataset does not change during the backup process. Note that this is unrelated to the SnapshotIQ feature of your target PowerScale cluster.

To use HDFS snapshots, you must first allow snapshots for a particular directory:

[gpadmin@phddas2-master-0 ~]$ hdfs dfsadmin -allowSnapshot /mydata
Allowing snapshot on /mydata succeeded

Immediately before a backup with DistCp, create the HDFS snapshot:

[gpadmin@phddas2-master-0 ~]$ hdfs dfs -createSnapshot /mydata backupsnap
Created snapshot /mydata/.snapshot/backupsnap

The name of this snapshot is backupsnap. You can access it at the HDFS path /mydata/.snapshot/backupsnap. Any changes to your HDFS files after this snapshot are not reflected in the subsequent backup. You can back up the snapshot to PowerScale using the following command:

[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata/.snapshot/backupsnap hdfs://all-nc-s-hdfs/backup/mydata

When the backup command completes, you can delete the snapshot. Doing so frees up any space used to hold older versions of files that were modified since the snapshot:

[gpadmin@phddas2-master-0 ~]$ hdfs dfs -deleteSnapshot /mydata backupsnap

USING PowerScale SNAPSHOTS FOR YOUR BACKUP TARGET
Independent from using snapshots for your backup source, you may want to keep multiple snapshots of your backup target directory to restore older versions of files.

To create snapshots on PowerScale, you must have a SnapshotIQ license. You can create snapshots using the web admin interface or CLI. To create a single PowerScale snapshot manually with the CLI, SSH into any PowerScale node and run the following:

all-nc-s-1# isi snapshot snapshots create /ifs/hadoop/backup/mydata --name backup-2014-07-01 --expires 1D --verbose
Created snapshot backup-2014-07-01 with ID 6

You can add this command to the backup process discussed in the Scheduling Backups section below.

For more details regarding PowerScale OneFS snapshots, refer to the PowerScale OneFS CLI Administration Guide for your version of OneFS.

SYNCIQ REPLICATION FOR MULTIPLE PowerScale CLUSTERS
After the DistCp backup to the PowerScale cluster completes, you can use OneFS SyncIQ to replicate snapshots across a WAN to other PowerScale clusters. Replicated snapshots can provide a very versatile and efficient component of your disaster recovery strategy.


Image


HANDLING DELETED FILES
By default, files deleted from the source Hadoop cluster are not deleted from the target Hadoop cluster. If you require this behavior, add the -delete argument to the DistCp command. When using this command, it is recommended to use snapshots on the backup target to allow for the recovery of deleted files.


SCHEDULING BACKUPS
You can automate and schedule the steps to back up a Hadoop cluster using a variety of methods. Apache Oozie is often used to automate Hadoop tasks and it directly supports DistCp. Cron can also be used to execute a suitable Shell script. To automate running commands in an SSH session, enable password-less SSH. The password-less SSH allows a cron user to connect to your Hadoop client and your PowerScale cluster (if using SnapshotIQ).


RECOVERY METHODS

REVERSE DISTCP
The standard method to restore a DistCp backup, from PowerScale to a traditional Hadoop infrastructure, is to run DistCp in the reverse direction. Do this by swapping the source and target paths.

[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hdfs://all-nc-s-hdfs/backup/mydata /mydata

You may want to create a snapshot of the target directory so that you can undo any mistakes made during the recovery process. However, be aware of the additional disk usage needed to maintain snapshots.

DIRECT ACCESS TO BACKUP DATA USING HDFS

The backup target files on PowerScale are accessible from Hadoop applications in the same way as the source files, due to PowerScale’s support for HDFS. You can use your backup data directly, without having to first restore it to your original source Hadoop environment. This capability saves analysis time. For example, if you normally run a MapReduce command like this:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep /mydata/mydataset1 output1 ABC

You can run the MapReduce job against the backup dataset on PowerScale using the following command:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep hdfs://all-nc-s-hdfs/backup/ /mydata/mydataset1 output1 ABC

To specify a fully-qualified Hadoop path instead of using the fs.defaultFS parameter, check with your application provider for details. Also, an PowerScale cluster that is designed for backup and archive, instead of for high performance, is likely not to provide the same performance as your primary Hadoop environment. Testing is recommended, or consult with Dell EMC PowerScale for proper sizing.

RECOVERY FROM PowerScale SNAPSHOTS
You can recover files from a previous PowerScale snapshot. The files are available in the /ifs/.snapshot directory. For details and other options, refer to the PowerScale OneFS CLI Administration Guide.

HDFS VERSION COMPATIBILITY
PowerScale is compatible with multiple versions of HDFS. You can use them simultaneously to access the same dataset. PowerScale can automatically detect the appropriate HDFS version per connection without any configuration. Refer to the PowerScale OneFS CLI Administration Guide for the list of supported Hadoop distributions and versions or visit Hadoop Distributions and Products Supported by OneFS. Version compatibility means that multiple Hadoop environments running different versions of Hadoop can easily backup to a single PowerScale cluster using HDFS.

In the case where PowerScale does not support your Hadoop version, you can still use DistCp to backup and restore your Hadoop data with PowerScale using HFTP. For instance, PHD 2.0 and later is not supported on PowerScale OneFS 7.1.1 and earlier. In this configuration, you need to build a small Hadoop cluster using a version of Hadoop that is directly supported by PowerScale, then run DistCp on this new cluster using the HFTP protocol to access your source data on your original Hadoop cluster. The HFTP protocol is a read-only file system that is compatible across different versions of Hadoop. For example:

[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hftp://phddas2-namenode-0/mydata hdfs://all-nc-s-hdfs/backup/mydata

The size of the new small cluster that will run the DistCp MapReduce job primarily depends on how much throughput is required. If you only need to back up at the rate of 10 Gbps, then you only need a single Hadoop node. None of your data will be stored on this small Hadoop cluster so disk requirements are minimal.

Image


CONCLUSION
Dell EMC PowerScale is a great platform for Hadoop and other Big Data applications. It uses erasure coding to protect data with greater than 80% storage efficiency, in contrast to traditional HDFS with 33% storage efficiency. Dell EMC PowerScale has several classes of node types, from the very dense NL400, to the high performance S210, and the X410 in between. The different node types allow you to optimize different PowerScale tiers for particular workloads. The backup of traditional Hadoop environments to PowerScale is easy to do and allows for the densest usable HDFS backup target.


Article ID: SLN319148

Last Date Modified: 07/08/2020 06:08 PM

Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please select whether the article was helpful or not.
Comments cannot contain these special characters: <>()\
characters left.