Data Transfer from non-Isilon Cluster to Isilon Cluster

Question

Hey Community, Here is the scenario... We have Source (None Isilon Cluster), Destination (Isilon Cluster) We are trying to copy nearly 1 PB of data from the Source to the Destination cluster.  We would like to find the right command/script that will help us maximize the throughput to the Destination from the Source.  Knowing that Isilon has numerous heads that users and systems can write directly to I was wondering if anyone has a command/script that can be used on a LINUX server (Source) that would perform this task to the Isilon cluster (Destination).  A command/script that would be multi-threaded, but across multiple nodes via their IP address so that we dont over load one node on the Destination Cluster (Isilon). Thank you,

carlilek · Answer

Look into gnu parallel + rsync. Here's what I typically do:

cd /source/cluster/path

ls | parallel -q rsync -vaWh --inplace {} /target/cluster/path/

If you're only using one client, you won't oversaturate a node, so I wouldn't bother with the IP addresses. If you're using multiple clients, sure.

Anonymous User · Answer

This is what SyncIQ is for - it will parallelize the I/O across source and destination nodes.  We use it a LOT and it works well.

Peter_Sero · Answer

Not quite -- if one reads None Isilon as non-Isilon....&#xa0;&#xa0; scnr-- Peter

Anonymous User · Answer

That makes a difference. I read it One, Peter read it none.

Yes, parallel rsync is your friend from a generic Linux host to Isilon. The more hosts and threads you can throw at it, the better.

Ideally, you would have a good look at the data and decide which directories to parallelize at which level. Always start the largest directories first, sorting down to the smallest. As the smaller ones finish, other even smaller ones will start up. If you don't do this, Murphy's Law says you'll do all of the smaller directories first, and when you're done to your last thread, the really large one will start up.

I have not found an easy way to parallelize the transfers without researching the source data first. You need to factor in both the file count and file sizes to balance the threads properly.

Watch the encryption you're doing as well. Don't use the most secure protocols if you don't need them. Drop down to blowfish if you must tunnel over ssh but it's within your localized secure environment. You can make this change in ~/.ssh/config if that's easier for you.

It should go without saying that you should use a current version of rsync.

chjatwork · Answer

Oops... thanks and I edit that typo

Isilon

Was this post helpful?