Large File Transfers Of Ngs Data: Rsync / Bbcp / Unison / What?
7
11
Entering edit mode
9.7 years ago
Dan Sheppard ▴ 110

I've read a few answers concerning the transfer of large data quickly and reliably across the internet. I've not been able to find a tool which combines these features. Does anyone know if a combination of options in the commonly used tool achieves this?

• Multiple TCP streams or UDP for very fast transfer of bulk data
• Similarly sensible re disk writes, threads, poll/select and copying to get stuff onto disk quickly
• Checking of ownership and permissions as well as checksumming at both ends
• Handles multiple small files as efficiently as very large files
• Can run "rsync-style", only transferring diffs when appropriate
• Has good security integrity and authenticity guarantees (secrecy not required)
• good quality linux server and client with robust error detection and reporting.

I've had a look at rsync, fdt, bbcp, unison, aspera, udt/udr, &c, and all seem to offer a subset of these features?

Obviously, through a combination of tools and a load of glue scripts I could achieve this with existing tools, but before going to the effort, if it's just a magic combination of parameters, do let me know!

data • 19k views
4
Entering edit mode
9.7 years ago

As much as I hate to admit it, Aspera performs all of these functions except:

good quality linux server and client with robust error detection and reporting

The linux client is terrible to find (see update), and the command line client provides only a basic level of documentation. What about the BitTorrent protocol? I haven't seen anyone using it for NGS data transfers, but regardless of the ethics surrounding its popularity it's actually a great data transfer protocol, supporting private, encrypted transfers, UDP transfers, scales with file size and number of connections, performs strong checksumming, block-based partial transfers, and has several wonderful Linux clients. The only thing missing from your list would be checking ownership and permissions at each end.

Update: See Michelle's comment below. You should now be able to do

sh <(curl -s http://demo.asperasoft.com/ascp-install-3.5.4.102989-linux-64.sh)


2
Entering edit mode

All - Since multiple people commented it was hard to find the Aspera Linux client and unix-appropriate docs, we posted a self-extracting installer for our 'ascp' Linux command line binary, with man page.

http://demo.asperasoft.com/ascp-install-3.5.4.102989-linux-64.sh

Extract the contents, and run man ascp for all details of usage.

We will find a more permanent home on our web site www.asperasoft.com) soon. Hopefully this is helpful, and if any questions or feedback feel free to write us at support@asperasoft.com.

Thank you,
Michelle

1
Entering edit mode

We have also added a permanent home for the ascp installer on our web site:

man ascp


gives all usage.

If any other platforms (OS X, Win, Solaris, etc.) are needed please let us know. We support them but don't get as many requests for the standalone CLI.

Michelle

0
Entering edit mode

That was my experience, too, Matt. Sadly, the main reason we're moving from the existing solution is terrible error reporting meaning that we discover far too late that large jobs have failed in subtle ways around release time causing delays. If people don't need this, then I guess Aspera is probably the way to go (if you can afford it).

0
Entering edit mode

BitTorrent isn't something I'd considered. I'll check that out. Assuming you can create a partitioned network away from the Wild West, I don't think there should be any issues with the main network's nefarious uses. Might have to warn our networks guys, though, or it would scare them to death, :-).

1
Entering edit mode

I just wrote a post benchmarking BitTorrent vs scp. I'm not going to benchmark against Aspera, since I don't have a server license, but I think as far as throughput it would go aspera,unison,udt > BitTorrent > scp,netcat,http,ftp,scp. The main benefit to using BitTorrent would be lightweight infrastructure and good, stable tools, as well as scalable distribution if you are sending data to more than one collaborator.

1
Entering edit mode

another benefit of bittorrent is that the data sources can be distributed across multiple locations. in realistic scenarios the download speeds are often capped at the source beyond one's reach. Simultaneous downloads from multiple sources is often substantially faster.

0
Entering edit mode

I would include bbcp in there as well. We have had some pretty good performance from it and it is almost a drop-in replacement for scp.

0
Entering edit mode

That's very interesting. Does bbcp have to be installed on both the source and sink client, or is it a "drop in replacement" in the sense that it only needs an ssh server on the receiving end?

0
Entering edit mode

It needs to be installed on both ends, but by "installed", the executable needs to be in the user's path. For some useful details, see:

http://pcbunn.cacr.caltech.edu/bbcp/using_bbcp.htm

0
Entering edit mode

Have you had a look at BitTorrent Sync? (http://labs.bittorrent.com/experiments/sync.html) One-way or two-way secure encrypted synchronisation. I've never tried it but really like the concept.

0
Entering edit mode

Yes, although not in the context of rsync-style folder synchronization on a server. It seems promising.

3
Entering edit mode
9.7 years ago

We use GlobusOnline for this. Cheaper than Aspera, has a command line interface, works great with Linux, very secure. Based on Grid-FTP. Handles big files, small files, in-between files.

0
Entering edit mode

That looks very promising. How easy are things like GridFTP to set up in your experience?

1
Entering edit mode

Globus is simple to set up. You need to make an account & install a straight-forward tool at each endpoint. Directions & tutorial are on their website. Excellent user support if you have more detailed questions.

0
Entering edit mode

Thanks, Alex. I'll give that a go, along with maybe a couple of the other suggestions here and report back on how I get on. The Globus Connect thing looks like it gets round the usual firewall worries.

(In the past when I've tried things called "Grid" it's meant massive, flaky java apps and expensive and weird additional infrastructure, hence my caution, but this definitely looks like an exception to that).

0
Entering edit mode

Just a note that at least one endpoint needs to be a full-fledged GridFTP. Globus connect, the simple installer, does not work for both endpoints of a transfer.

2
Entering edit mode

We now have Globus Connect-to-Globus Connect transfers working. See https://www.globus.org.

0
Entering edit mode

Is this possible without Globus Plus? And is it possible to get Globus Plus without a Provider license? I've been trying to figure this out for a few days but even after contacting support, it's been hard to figure out what the best path is. No issues with paying a small fee for a license, but the provider plan would be mega-overkill. The firewall requirements for the full Globus server are pretty severe (in terms of # of ports), and Aspera Point-to-Point was surprisingly expensive when I got a quote recently. I only need to transfer a few hundred GB per month, but of course they're large gzipped files, not the sort of thing I want to rely on standard FTP or Dropbox for.

1
Entering edit mode

I'd try having a direct conversation with Globus support about your exact requirements, if you haven't already done so. There may have been some changes since this answer was posted ~2 years ago. Globus ought to be ideal for the kind of transfers you mention. Usually they will work directly with your network admins to manage port/firewall issues. That's a common hurdle across institutions -- maybe best to just get everyone a call together?

0
Entering edit mode

Globus connect personal is just fine for this kind of thing, is free, and does not have the same requirements for open firewall ports (it has none). If you want to use the "sharing" feature, then you'll need a "plus" account for about \$7/month. I agree with Alex that having a direct conversation with the Globus folks is never a bad idea, though.

0
Entering edit mode

Thanks for sharing this. It looks very slick.

2
Entering edit mode
9.7 years ago

File synchronization is far more complex of a problem than one would anticipate. For example checking permission and ownership is not nearly as simple as it sounds - there are some security implications plus other limitations.

As far as I know Unison is by far the most sophisticated because it offers a replication in both directions (and that is a far more complicated task) whereas most of the other tools that you list are just transfer tools.

0
Entering edit mode

Unison is slightly more painful than the other options to install, so I've been going by the manual on its features. One thing I couldn't tell was how efficiently it handled the transfer process, either directly or by delegating it to a third party. We get substantial (ie orders of magnitude) better transfer rates using either a FDT type multi-TCP or UDR/UDT style UDP system than we do from rsync+ssh. The bit of the unison manual which talks about this seems to suggest we'd only get the latter's performance? One of the reasons I like the look of Unison is we have some other, unrelated multiple-synced filesystem annoyances which unison could potentially also solve.

While full filesystem security checking is painful, all the mistakes we've had so far have been wrong user/group/perms checking which benefit from just stat(2)/chown(2)/chmod(2) of the kind which rsync tidies nicely. Of course, when we stop these happening we'll start seeing the more exotic issues, :-).

1
Entering edit mode
9.7 years ago
Gabriel R. ★ 2.9k

There is another possibility. If your collaborators are just interested in a specific chunk, just stick 'em on an ftp and samtools index/tabix index your files and they will be able to call them remotely. If they need the whole thing, then this approach is no good.

0
Entering edit mode

(for completeness, I should say that we have a site-wide Aspera licence, so the price isn't an issue for us, but that's irrelevant for other readers of this question).

1
Entering edit mode
9.7 years ago

From what i understand, Aspera is good but it has a very high price tag. Data Expedition is not quite as slick but it performs similarly in many respects.

1
Entering edit mode
9.7 years ago
Seth Noble ▴ 10

Apologies for posting about my own company, but it does seem relevant to the OP.

My company, Data Expedition, Inc. produces commercial software that covers most, if not all, of the above requirements. As Jeremy implied, our utilities are more technically focused and much lower priced than other offerings.

Our ExpeDat and SyncDat software use proprietary UDP data transport to "fill the pipe". Meta data like unicode file names, date stamps, and unix permissions are preserved. Security and integrity are guaranteed. The ExpeDat command line client, "movedat", has many features to enable scripting and embedded operation. It has a "Streaming Folders" mode that lets you transfer billions of tiny files in a single stream. The SyncDat product does directory comparison and only transfers changed files, similar in some respects to rsync. Free trials of both are available on our website (no need to talk to anyone or get "approved" just to try it).

I'm happy to answer any questions here or offline, and we have full documentation and technical notes publicly available online as well.

1
Entering edit mode
9.7 years ago
always_learning ★ 1.1k

Hi all.

These are few file transferring tools which were used copying files with NGS analysis.

Lets make your though on this.

Thanks Syed