Article Number: 000126807

Backing Up Hadoop To PowerScale

Summary: This article describes the recommended best practices for the backup of non-Isilon Hadoop environments to an Isilon cluster.

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content

Instructions

NOTE: This topic is part of the Using Hadoop with OneFS Info Hub.

Best Practices for using DistCp to Back Up Hadoop

This article describes the recommended best practices for the backup of non-PowerScale Hadoop environments to a Dell PowerScale cluster. With its robust erasure-coding data protection that provides greater than 80% storage efficiency, Dell PowerScale is an ideal backup target for data located on a Hadoop cluster. DistCp (distributed copy) is a standard tool that comes with all Hadoop distributions and versions. DistCp can copy entire Hadoop directories. DistCp runs as a MapReduce job to perform file copies in parallel, fully using your systems if required. There is also an option to limit the bandwidth to control the impact on other tasks.

ENVIRONMENT
This article uses the following test environment:

Pivotal HD (PHD) 2.0.1, installed using Pivotal Control Center 2.0, all settings use default values. In particular, HDFS is installed on the PHD nodes for a traditional DAS configuration.
PowerScale OneFS 7.2.0

Because DistCp is a standard Hadoop tool, the approach outlined in this document is applicable to most, if not all other Hadoop distributions and versions.

While reading this document, assume that the data to back up is located on the PHD Hadoop HDFS cluster in the directory /mydata. The examples back up this data to the PowerScale cluster in the directory /ifs/hadoop/backup/mydata.

Figure 1: Backup a Hadoop Cluster to Isilon

BACKUP METHODS:

THE SIMPLEST BACKUP METHOD

The simplest backup command is shown below:

[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update /mydata hdfs://all-nc-s-hdfs/backup/mydata

You can run the above command on any host which has the Hadoop client (hadoop) installed. The user running the command must have permissions to read the source files and write the target files.

The options -skipcrccheck and -update must be specified to avoid the CRC check on the target files that are placed on the PowerScale cluster. PowerScale does not store the Hadoop CRC, and calculating it would be too expensive. Therefore, these options are required to prevent errors related to the CRC check.

The next parameter "/mydata" is the source path on the source Hadoop cluster. This could also be "/" to back up your entire HDFS namespace. Since the path is not fully qualified, it uses the HDFS NameNode specified in the fs.defaultFS parameter of core-site.xml.

The final parameter "hdfs://all-nc-s-hdfs/backup/mydata" is the target path on your PowerScale cluster. The host portion "all-nc-s-hdfs" can be a relative or fully qualified DNS name such as all-nc-s-hdfs.example.com. It should be the SmartConnect Zone DNS name for your PowerScale cluster. The directory portion "/backup/mydata" is relative to the HDFS root path defined in your PowerScale cluster access zone. If your HDFS root path is /ifs/hadoop, then this value refers to /ifs/hadoop/backup/mydata.

Files whose sizes are identical on the source and target directories are assumed to be unchanged and are not copied. In particular, file timestamps are not used to determine changed files. For more details on DistCp, see the Hadoop DistCp Version 2 Guide.

COPYING PERMISSIONS
By default, the owner, group, and permissions of the target files are reset to the default for new files created by the user initiating DistCp. Any owner, group, and permissions defined for the source file are lost. To retain this information from the source files, use the -p option. Because the -p option must perform chown/chgrp, the user initiating DistCp must be a superuser on the target system. The root user on the PowerScale cluster works for this purpose. For example:

[root@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update -pugp /mydata hdfs://all-nc-s-hdfs/backup/mydata

USING SNAPSHOTS FOR YOUR BACKUP SOURCE
The backup of large datasets may take a long time. Files that exist at the beginning of the DistCp process when the directory structure is scanned, may no longer exist when that file is copied. This change in files produces errors. Further, an application may require a consistent single point-in-time backup for it to be usable. To deal with these issues, it is recommended that you create an HDFS snapshot of your source to ensure that the dataset does not change during the backup process. This is unrelated to the SnapshotIQ feature of your target PowerScale cluster.

To use HDFS snapshots, you must first allow snapshots for a particular directory:

[gpadmin@phddas2-0 ~]$ hdfs dfsadmin -allowSnapshot /mydata Allowing snapshot on /mydata succeeded

Immediately before a backup with DistCp, create the HDFS snapshot:

[gpadmin@phddas2-0 ~]$ hdfs dfs -createSnapshot /mydata backupsnap Created snapshot /mydata/.snapshot/backupsnap

The name of this snapshot is backupsnap. You can access it at the HDFS path /mydata/.snapshot/backupsnap. Any changes to your HDFS files after this snapshot are not reflected in the subsequent backup. You can back up the snapshot to PowerScale using the following command:

[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update /mydata/.snapshot/backupsnap hdfs://all-nc-s-hdfs/backup/mydata

When the backup command finishes running, you can delete the snapshot. Doing so frees up any space used to hold older versions of files that were modified since the snapshot:

[gpadmin@phddas2-0 ~]$ hdfs dfs -deleteSnapshot /mydata backupsnap

USING PowerScale SNAPSHOTS FOR YOUR BACKUP TARGET
Independent from using snapshots for your backup source, you may want to keep multiple snapshots of your backup target directory to restore older versions of files.

To create snapshots on PowerScale, you must have a SnapshotIQ license. You can create snapshots using the web admin interface or CLI. To create a single PowerScale snapshot manually with the CLI, SSH into any PowerScale node and run the following:

all-nc-s-1# isi snapshot snapshots create /ifs/hadoop/backup/mydata --name backup-2014-07-01 --expires 1D --verbose Created snapshot backup-2014-07-01 with ID 6

You can add this command to the backup process discussed in the Scheduling Backups section below.

For more details regarding PowerScale OneFS snapshots, see the PowerScale OneFS CLI Administration Guide for your version of OneFS: PowerScale OneFS Info Hubs

SYNCIQ REPLICATION FOR MULTIPLE PowerScale CLUSTERS
After the DistCp backup to the PowerScale cluster completes, you can use OneFS SyncIQ to replicate snapshots across a WAN to other PowerScale clusters. Replicated snapshots can provide a versatile and efficient component of your disaster recovery strategy.

Figure 2: SynIQ Replication for multiple Isilon clusters

HANDLING DELETED FILES
By default, files deleted from the source Hadoop cluster are not deleted from the target Hadoop cluster. If you require this behavior, add the -delete argument to the DistCp command. When using this command, it is recommended to use snapshots on the backup target to allow for the recovery of deleted files.

SCHEDULING BACKUPS
You can automate and schedule the steps to back up a Hadoop cluster using various methods. Apache Oozie is often used to automate Hadoop tasks, and it directly supports DistCp. CRON can also be used to run a Shell script. To automate running commands in an SSH session, enable password-less SSH. The password-less SSH allows a CRON user to connect to your Hadoop client and your PowerScale cluster (if using SnapshotIQ).

RECOVERY METHODS

REVERSE DISTCP

The standard method to restore a DistCp backup from PowerScale to a traditional Hadoop infrastructure is to run DistCp in the reverse direction. Do this by swapping the source and target paths.

[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update hdfs://all-nc-s-hdfs/backup/mydata /mydata

You may want to create a snapshot of the target directory so that you can undo any mistakes made during the recovery process. However, be aware of the additional disk usage required to maintain snapshots.

DIRECT ACCESS TO BACKUP DATA USING HDFS

The backup target files on PowerScale are accessible from Hadoop applications in the same way as the source files, due to PowerScale’s support for HDFS. You can use your backup data directly, without having to first restore it to your original source Hadoop environment. This capability saves analysis time. For example, if you run a MapReduce command like this:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep /mydata/mydataset1 output1 ABC

You can run the MapReduce job against the backup dataset on PowerScale using the following command:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep hdfs://all-nc-s-hdfs/backup/ /mydata/mydataset1 output1 ABC

To specify a fully qualified Hadoop path instead of using the fs.defaultFS parameter, check with your application provider for details. Also, a PowerScale cluster that is designed for backup and archive, instead of for high performance, is likely not to provide the same performance as your primary Hadoop environment. Testing is recommended, or consult with Dell PowerScale for proper sizing.

RECOVERY FROM PowerScale SNAPSHOTS

You can recover files from a previous PowerScale snapshot. The files are available in the /ifs/.snapshot directory. For details and other options, see the PowerScale OneFS CLI Administration Guide.

HDFS VERSION COMPATIBILITY
PowerScale is compatible with multiple versions of HDFS. You can use them simultaneously to access the same dataset. PowerScale can automatically detect the appropriate HDFS version per connection without any configuration. See the PowerScale OneFS CLI Administration Guide for the list of supported Hadoop distributions and versions or go to Hadoop Distributions and Products Supported by OneFS. Version compatibility means that multiple Hadoop environments running different versions of Hadoop can backup to a single PowerScale cluster using HDFS.

Where PowerScale does not support your Hadoop version, you can still use DistCp to backup and restore your Hadoop data with PowerScale using HFTP. For instance, PHD 2.0 and later is not supported on PowerScale OneFS 7.1.1 and earlier. In this configuration, you must build a small Hadoop cluster using a version of Hadoop that PowerScale directly supports. Once the Hadoop cluster is built, run DistCp on this new cluster using the HFTP protocol to access your source data on your original Hadoop cluster. The HFTP protocol is a read-only file system that is compatible across different versions of Hadoop. For example:
[gpadmin@phddas2-0 ~]$ hadoop distcp -skipcrccheck -update hftp://phddas2-namenode-0/mydata hdfs://all-nc-s-hdfs/backup/mydata

The size of the new small cluster that runs the DistCp MapReduce job primarily depends on the amount of throughput that is required. If you only require back up at the rate of 10 Gbps, then you only need a single Hadoop node. None of your data is stored on this small Hadoop cluster so disk requirements are minimal.

Figure 3: Backup a Hadoop cluster to Isilon with incompatible protocol versions

CONCLUSION

Dell PowerScale is a great platform for Hadoop and other Big Data applications. It uses erasure coding to protect data with greater than 80% storage efficiency, in contrast to traditional HDFS with 33% storage efficiency. Dell PowerScale has several classes of node types, from the dense NL400, to the high-performance S210, and the X410 in between. The different node types allow you to optimize different PowerScale tiers for particular workloads. The backup of traditional Hadoop environments to PowerScale is easy to do and allows for the densest usable HDFS backup target.

Backing Up Hadoop To PowerScale

Summary: This article describes the recommended best practices for the backup of non-Isilon Hadoop environments to an Isilon cluster.

Article Content

Instructions

Best Practices for using DistCp to Back Up Hadoop

BACKUP METHODS:

THE SIMPLEST BACKUP METHOD

RECOVERY METHODS

REVERSE DISTCP

DIRECT ACCESS TO BACKUP DATA USING HDFS

RECOVERY FROM PowerScale SNAPSHOTS

CONCLUSION

Article Properties

Affected Product

Last Published Date

Version

Article Type

Welcome

Welcome to Dell

Backing Up Hadoop To PowerScale

Summary: This article describes the recommended best practices for the backup of non-Isilon Hadoop environments to an Isilon cluster.

Article Content

Instructions

Best Practices for using DistCp to Back Up Hadoop

BACKUP METHODS:

THE SIMPLEST BACKUP METHOD

RECOVERY METHODS

REVERSE DISTCP

DIRECT ACCESS TO BACKUP DATA USING HDFS

RECOVERY FROM PowerScale SNAPSHOTS

CONCLUSION

Article Properties

Affected Product

Last Published Date

Version

Article Type