20 Posts

September 12th, 2014 12:00

Even am trying to find its ans

99 Posts

September 12th, 2014 14:00

A couple things here:

1) You can use SyncIQ to copy files locally from one directory to another.  It's very efficient.  Try it, you'll like it :-)

2) If you want to use cp -c (clone files) you can call cp -c using xargs to clone a whole bunch of files to a given directory all in one command instead of having to do one at a time.  The downside is only one node is involved in the cp process.  SyncIQ uses all nodes with multiple workers to move files.

Have fun!

1 Rookie

 • 

21 Posts

September 13th, 2014 00:00

Is there no way of using all parties in the copy process, besides using SyncIQ? As I can tell, SyncIQ is a licensed feature, which we will never use, and hence have not purchased. I could perhaps do the filecopy using the FTP protocol to the outside interface, and hence get all notes working at once. I will have to investigate this.

4 Operator

 • 

1.2K Posts

September 13th, 2014 02:00

Why not getting a 30-day SyncIQ evaluation license from your sales rep?

On the other hand --- copying / cloning / linking -- what do you

actually want to accomplish: more data redundancy or multiple

access paths (permissions...) to the same data, or... ?

Cloning doesn't produce additional data redundancy;

linking will provide new access paths, but with identical permissions

on the file level (possibly different permissions on the paths).

-- Peter

99 Posts

September 13th, 2014 04:00

The eval license is key, pun intended

Anyway - yes, there is a way to distribute cp work.  Think about it.  You can easily script the copying of subdirectories and distribute that work to each node.  isi_for_array is your friend here.

But I would opine that you (spengler) indeed have use for SyncIQ.  Your use case - parallel copying of files into directories - is front and center.  SIQ is a hugely useful capability and is well worth the license fee.  Time is money.  Work with your account team; everything is negotiable!

1 Rookie

 • 

21 Posts

September 13th, 2014 07:00

I know my case is a bit odd. It makes sense if you know the bigger picture. All I can say is that the files need to be located in two different locations for a short time. The files can not be hardlinked, but a clone would work. Later the files will be replace by other files, but for a short period they will remain as copies.

I will not have any other use for the SyncIQ feature in the future. I could have a 30 days license, I guess, but for now I will have to make due with the rsync from one host option. It works, although I had hoped this could also be distributed to all hosts in some way.

1 Rookie

 • 

21 Posts

September 13th, 2014 08:00

The final picture is this. I will have an original file in one location, and a modified version of the file in another location. But during the migration I have chosen to have the original files in both locations to save time. Later the files in the other location, will be replaced with the modified version again, but since the modifications take time, the originals will be used. Hence: I need the same files in two folders.

September 13th, 2014 12:00

There are many approaches to making transfers suck less but they all have tradeoffs.  Optimizing performance means you need to know how your data is structured in your directory trees and how balanced the size of the files are in relation to the number of files.  The trick is parallelize  the copies until your CPU or network tips over.  Isilon also makes this challenging because ideally you want as many nodes as possible participating in the work (that's why SyncIQ is a nice tool for this job).

The IB interfaces aren't the fastest and if you can use the external 10Gbps interfaces, you'll get better speeds.

I typically run my tools from a Linux system that has 10Gbps connectivity to the Isilon arrays via NFS, and on which I feel more comfortable installing tools like GNU Parallel rather than installing them on the array itself.

Assuming GNU Parallel is in your path, try something like this:

# find /source/ -maxdepth 1 -mindepth 1 -type d \

  parallel -v --jobs 4 \

  rsync -a --delete {}/   /destination/{/}/

This will run one stream for every directory in /source and start 4 streams at one.  As one finishes, another will start.  You can mount /source and /destination on different Isilon nodes to help balance the load.  If you have a large cluster, mounting each directory under /source and /destination separately on different nodes might also help.

You should fine-tune the find command based on your directory structure.  For example, you may want to drop down a directory level or two - it's impossible for me to tell you where in the path you want to start your parallelism.

1 Rookie

 • 

21 Posts

September 13th, 2014 13:00

Thank you for the input. I also run my tools on a linux client connected by a number of 10G interfaces. I will look at this workaround. Once again, thanks.

1 Rookie

 • 

21 Posts

September 13th, 2014 13:00

Yes, the problem with the links are definitely that I need separate permission on the "copies" folder. Hence I need the extra copy of files, which make a lot more sense when the files will no longer be identical.

4 Operator

 • 

1.2K Posts

September 28th, 2014 23:00

Claudio, cool idea!  Can't access the internal document, does it discuss the situation where something like "run as root" would be needed to keep file permissions etc intact?

-- Peter

179 Posts

November 10th, 2014 14:00

Hi Peter,

Claudio published his blog on ECN last week, in case you want to have a look.

Backing Up Hadoop To Isilon

No Events found!

Top