1) You can use SyncIQ to copy files locally from one directory to another. It's very efficient. Try it, you'll like it :-)
2) If you want to use cp -c (clone files) you can call cp -c using xargs to clone a whole bunch of files to a given directory all in one command instead of having to do one at a time. The downside is only one node is involved in the cp process. SyncIQ uses all nodes with multiple workers to move files.
Is there no way of using all parties in the copy process, besides using SyncIQ? As I can tell, SyncIQ is a licensed feature, which we will never use, and hence have not purchased. I could perhaps do the filecopy using the FTP protocol to the outside interface, and hence get all notes working at once. I will have to investigate this.
Anyway - yes, there is a way to distribute cp work. Think about it. You can easily script the copying of subdirectories and distribute that work to each node. isi_for_array is your friend here.
But I would opine that you (spengler) indeed have use for SyncIQ. Your use case - parallel copying of files into directories - is front and center. SIQ is a hugely useful capability and is well worth the license fee. Time is money. Work with your account team; everything is negotiable!
I know my case is a bit odd. It makes sense if you know the bigger picture. All I can say is that the files need to be located in two different locations for a short time. The files can not be hardlinked, but a clone would work. Later the files will be replace by other files, but for a short period they will remain as copies.
I will not have any other use for the SyncIQ feature in the future. I could have a 30 days license, I guess, but for now I will have to make due with the rsync from one host option. It works, although I had hoped this could also be distributed to all hosts in some way.
The final picture is this. I will have an original file in one location, and a modified version of the file in another location. But during the migration I have chosen to have the original files in both locations to save time. Later the files in the other location, will be replaced with the modified version again, but since the modifications take time, the originals will be used. Hence: I need the same files in two folders.
There are many approaches to making transfers suck less but they all have tradeoffs. Optimizing performance means you need to know how your data is structured in your directory trees and how balanced the size of the files are in relation to the number of files. The trick is parallelize the copies until your CPU or network tips over. Isilon also makes this challenging because ideally you want as many nodes as possible participating in the work (that's why SyncIQ is a nice tool for this job).
The IB interfaces aren't the fastest and if you can use the external 10Gbps interfaces, you'll get better speeds.
I typically run my tools from a Linux system that has 10Gbps connectivity to the Isilon arrays via NFS, and on which I feel more comfortable installing tools like GNU Parallel rather than installing them on the array itself.
Assuming GNU Parallel is in your path, try something like this:
# find /source/ -maxdepth 1 -mindepth 1 -type d \
parallel -v --jobs 4 \
rsync -a --delete {}/ /destination/{/}/
This will run one stream for every directory in /source and start 4 streams at one. As one finishes, another will start. You can mount /source and /destination on different Isilon nodes to help balance the load. If you have a large cluster, mounting each directory under /source and /destination separately on different nodes might also help.
You should fine-tune the find command based on your directory structure. For example, you may want to drop down a directory level or two - it's impossible for me to tell you where in the path you want to start your parallelism.
Thank you for the input. I also run my tools on a linux client connected by a number of 10G interfaces. I will look at this workaround. Once again, thanks.
Yes, the problem with the links are definitely that I need separate permission on the "copies" folder. Hence I need the extra copy of files, which make a lot more sense when the files will no longer be identical.
Claudio, cool idea! Can't access the internal document, does it discuss the situation where something like "run as root" would be needed to keep file permissions etc intact?
na75369
20 Posts
0
September 12th, 2014 12:00
Even am trying to find its ans
peglarr
99 Posts
0
September 12th, 2014 14:00
A couple things here:
1) You can use SyncIQ to copy files locally from one directory to another. It's very efficient. Try it, you'll like it :-)
2) If you want to use cp -c (clone files) you can call cp -c using xargs to clone a whole bunch of files to a given directory all in one command instead of having to do one at a time. The downside is only one node is involved in the cp process. SyncIQ uses all nodes with multiple workers to move files.
Have fun!
kim_stofa
1 Rookie
•
21 Posts
0
September 13th, 2014 00:00
Is there no way of using all parties in the copy process, besides using SyncIQ? As I can tell, SyncIQ is a licensed feature, which we will never use, and hence have not purchased. I could perhaps do the filecopy using the FTP protocol to the outside interface, and hence get all notes working at once. I will have to investigate this.
Peter_Sero
4 Operator
•
1.2K Posts
0
September 13th, 2014 02:00
Why not getting a 30-day SyncIQ evaluation license from your sales rep?
On the other hand --- copying / cloning / linking -- what do you
actually want to accomplish: more data redundancy or multiple
access paths (permissions...) to the same data, or... ?
Cloning doesn't produce additional data redundancy;
linking will provide new access paths, but with identical permissions
on the file level (possibly different permissions on the paths).
-- Peter
peglarr
99 Posts
0
September 13th, 2014 04:00
The eval license is key, pun intended
Anyway - yes, there is a way to distribute cp work. Think about it. You can easily script the copying of subdirectories and distribute that work to each node. isi_for_array is your friend here.
But I would opine that you (spengler) indeed have use for SyncIQ. Your use case - parallel copying of files into directories - is front and center. SIQ is a hugely useful capability and is well worth the license fee. Time is money. Work with your account team; everything is negotiable!
kim_stofa
1 Rookie
•
21 Posts
0
September 13th, 2014 07:00
I know my case is a bit odd. It makes sense if you know the bigger picture. All I can say is that the files need to be located in two different locations for a short time. The files can not be hardlinked, but a clone would work. Later the files will be replace by other files, but for a short period they will remain as copies.
I will not have any other use for the SyncIQ feature in the future. I could have a 30 days license, I guess, but for now I will have to make due with the rsync from one host option. It works, although I had hoped this could also be distributed to all hosts in some way.
kim_stofa
1 Rookie
•
21 Posts
0
September 13th, 2014 08:00
The final picture is this. I will have an original file in one location, and a modified version of the file in another location. But during the migration I have chosen to have the original files in both locations to save time. Later the files in the other location, will be replaced with the modified version again, but since the modifications take time, the originals will be used. Hence: I need the same files in two folders.
Anonymous User
170 Posts
1
September 13th, 2014 12:00
There are many approaches to making transfers suck less but they all have tradeoffs. Optimizing performance means you need to know how your data is structured in your directory trees and how balanced the size of the files are in relation to the number of files. The trick is parallelize the copies until your CPU or network tips over. Isilon also makes this challenging because ideally you want as many nodes as possible participating in the work (that's why SyncIQ is a nice tool for this job).
The IB interfaces aren't the fastest and if you can use the external 10Gbps interfaces, you'll get better speeds.
I typically run my tools from a Linux system that has 10Gbps connectivity to the Isilon arrays via NFS, and on which I feel more comfortable installing tools like GNU Parallel rather than installing them on the array itself.
Assuming GNU Parallel is in your path, try something like this:
# find /source/ -maxdepth 1 -mindepth 1 -type d \
parallel -v --jobs 4 \
rsync -a --delete {}/ /destination/{/}/
This will run one stream for every directory in /source and start 4 streams at one. As one finishes, another will start. You can mount /source and /destination on different Isilon nodes to help balance the load. If you have a large cluster, mounting each directory under /source and /destination separately on different nodes might also help.
You should fine-tune the find command based on your directory structure. For example, you may want to drop down a directory level or two - it's impossible for me to tell you where in the path you want to start your parallelism.
kim_stofa
1 Rookie
•
21 Posts
0
September 13th, 2014 13:00
Thank you for the input. I also run my tools on a linux client connected by a number of 10G interfaces. I will look at this workaround. Once again, thanks.
kim_stofa
1 Rookie
•
21 Posts
0
September 13th, 2014 13:00
Yes, the problem with the links are definitely that I need separate permission on the "copies" folder. Hence I need the extra copy of files, which make a lot more sense when the files will no longer be identical.
Peter_Sero
4 Operator
•
1.2K Posts
0
September 28th, 2014 23:00
Claudio, cool idea! Can't access the internal document, does it discuss the situation where something like "run as root" would be needed to keep file permissions etc intact?
-- Peter
Nikschen
179 Posts
1
November 10th, 2014 14:00
Hi Peter,
Claudio published his blog on ECN last week, in case you want to have a look.
Backing Up Hadoop To Isilon