Move data locally in filesystem

Question

I am currently copying a large amount of data (50TB) from an old SAN to Isilon. The transfer was done using FTP, and I have seen transfer rates around 5-10Gbps. The transfer is now done. I now need to copy the same files to another location on the local filesystem. The issue is that the local copy is much slower than the external FTP? The internal copy process is done by CLI using SSH to the first node in the cluster, and trying the below commands one by one. I have tried using the 'cp' and the 'rsync' command, but nothing seems to change the copy speed. I have used the following commands: rsync -avW /ifs/data/source/ /ifs/data/newcopy/ cp -v -n *.ts /ifs/data/source/ /ifs/data/newcopy/ No matter what parameters I try, the transfer does not go above 1.5Gbps through the infiniband links. From the CLI I do see the trafic moved over the Infiniband link, but the copy is much slower internally, then when using an external source over FTP?  I tested this using the command: netstat -ib One option is to use the 'cp -c' command, for at clone file copy, but this cannot list the files in the source folder (more than 10.000 files). Have anyone seen the same issue, and/or know how to go beyond this limit? We are using the Isilon 7.1.1.0 build. / Kim

na75369 · Answer

Even am trying to find its ans

peglarr · Answer

A couple things here:

1) You can use SyncIQ to copy files locally from one directory to another. It's very efficient. Try it, you'll like it :-)

2) If you want to use cp -c (clone files) you can call cp -c using xargs to clone a whole bunch of files to a given directory all in one command instead of having to do one at a time. The downside is only one node is involved in the cp process. SyncIQ uses all nodes with multiple workers to move files.

Have fun!

kim_stofa · Answer

Is there no way of using all parties in the copy process, besides using SyncIQ? As I can tell, SyncIQ is a licensed feature, which we will never use, and hence have not purchased. I could perhaps do the filecopy using the FTP protocol to the outside interface, and hence get all notes working at once. I will have to investigate this.

Peter_Sero · Answer

Why not getting a 30-day SyncIQ evaluation license from your sales rep?

On the other hand --- copying / cloning / linking -- what do you

actually want to accomplish: more data redundancy or multiple

access paths (permissions...) to the same data, or... ?

Cloning doesn't produce additional data redundancy;

linking will provide new access paths, but with identical permissions

on the file level (possibly different permissions on the paths).

-- Peter

peglarr · Answer

The eval license is key, pun intended

Anyway - yes, there is a way to distribute cp work. Think about it. You can easily script the copying of subdirectories and distribute that work to each node. isi_for_array is your friend here.

But I would opine that you (spengler) indeed have use for SyncIQ. Your use case - parallel copying of files into directories - is front and center. SIQ is a hugely useful capability and is well worth the license fee. Time is money. Work with your account team; everything is negotiable!

kim_stofa · Answer

I know my case is a bit odd. It makes sense if you know the bigger picture. All I can say is that the files need to be located in two different locations for a short time. The files can not be hardlinked, but a clone would work. Later the files will be replace by other files, but for a short period they will remain as copies.

I will not have any other use for the SyncIQ feature in the future. I could have a 30 days license, I guess, but for now I will have to make due with the rsync from one host option. It works, although I had hoped this could also be distributed to all hosts in some way.

kim_stofa · Answer

The final picture is this. I will have an original file in one location, and a modified version of the file in another location. But during the migration I have chosen to have the original files in both locations to save time. Later the files in the other location, will be replaced with the modified version again, but since the modifications take time, the originals will be used. Hence: I need the same files in two folders.

Anonymous User · Answer

There are many approaches to making transfers suck less but they all have tradeoffs. Optimizing performance means you need to know how your data is structured in your directory trees and how balanced the size of the files are in relation to the number of files. The trick is parallelize the copies until your CPU or network tips over. Isilon also makes this challenging because ideally you want as many nodes as possible participating in the work (that's why SyncIQ is a nice tool for this job).

The IB interfaces aren't the fastest and if you can use the external 10Gbps interfaces, you'll get better speeds.

I typically run my tools from a Linux system that has 10Gbps connectivity to the Isilon arrays via NFS, and on which I feel more comfortable installing tools like GNU Parallel rather than installing them on the array itself.

Assuming GNU Parallel is in your path, try something like this:

# find /source/ -maxdepth 1 -mindepth 1 -type d \

parallel -v --jobs 4 \

rsync -a --delete {}/ /destination/{/}/

This will run one stream for every directory in /source and start 4 streams at one. As one finishes, another will start. You can mount /source and /destination on different Isilon nodes to help balance the load. If you have a large cluster, mounting each directory under /source and /destination separately on different nodes might also help.

You should fine-tune the find command based on your directory structure. For example, you may want to drop down a directory level or two - it's impossible for me to tell you where in the path you want to start your parallelism.

kim_stofa · Answer

Thank you for the input. I also run my tools on a linux client connected by a number of 10G interfaces. I will look at this workaround. Once again, thanks.

kim_stofa · Answer

Yes, the problem with the links are definitely that I need separate permission on the 'copies' folder. Hence I need the extra copy of files, which make a lot more sense when the files will no longer be identical.

Peter_Sero · Answer

Claudio, cool idea!  Can't access the internal document, does it discuss the situation where something like 'run as root' would be needed to keep file permissions etc intact? -- Peter

Nikschen · Answer

Hi Peter, Claudio published his blog on ECN last week, in case you want to have a look. Backing Up Hadoop To Isilon

Isilon

Was this post helpful?