Like many that work with NGS data, I have a large storage server and a cluster for computation. However, I am asked not to leave all the files on the computation cluster as it is shared and with large but limited storage space. The link between storage/computation is up to 100MB/s. I sometimes have to copy around TB of data. I am using rsync, but with this kind of file size it has some limitations.
Mainly, when I want to "sync" two directories, even if many files are already present in both source and destination, it takes considerable time to compare the two. If the connection fall just before end of transfer, it might take up to half as much to re-sync.
What do you use to move around very large files? Is there a way to tell rsync to locally store some sort of label for successfully transferred files (not to look check that the two files are the same bit by bit or block by block) and skip them straight away?
rsync is otherwise very nice tool. It works over ssh, it let me cap the bandwith (to avoid saturation) and other nice options.
time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive source.bam $mac:/tmp building file list ... 1 file to consider source.bam 1017742131 100% 67.02MB/s 0:00:14 (xfer#1, to-check=0/1) Number of files: 1 Number of files transferred: 1 Total file size: 1017742131 bytes Total transferred file size: 1017742131 bytes Literal data: 1017742131 bytes Matched data: 0 bytes File list size: 45 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 1017866474 Total bytes received: 42 sent 1017866474 bytes received 42 bytes 61688879.76 bytes/sec total size is 1017742131 speedup is 1.00 real 0m15.624s user 0m9.822s sys 0m4.545s time rsync --verbose --progress --stats --rsh="/usr/bin/ssh -c arcfour" --recursive source.bam destination:/tmp building file list ... 1 file to consider source.bam 1017742131 100% 171.02MB/s 0:00:05 (xfer#1, to-check=0/1) Number of files: 1 Number of files transferred: 1 Total file size: 1017742131 bytes Total transferred file size: 1017742131 bytes Literal data: 0 bytes Matched data: 1017742131 bytes File list size: 45 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 127739 Total bytes received: 223405 sent 127739 bytes received 223405 bytes 16332.28 bytes/sec total size is 1017742131 speedup is 2898.36 real 0m21.620s user 0m4.780s sys 0m0.846s
sorry, I read the help as as "skip [check]" instead of "skip [files]". How about: --size-only "skip files that match in size"?
Have you tried the "-c" option: "skip based on checksum, not mod-time & size"
Thanks, but from the man page: "This forces the sender to checksum every regular file using a 128-bit MD4 checksum." So I definetely DO NOT want this. I think rsync does something more than just checking mod-time & size...
what about mounting a folder from the storage server on the cluster?
@Giovanni. Not really an option, and out of my "power". There is also a technical reason: the storage attached to the computing cluster has very high performance (3.2GB/s) to avoid being the bottleneck. Files have to be moved there first.