Question: Large File Transfers Of Ngs Data: Rsync / Bbcp / Unison / What?
11
gravatar for Dan Sheppard
4.4 years ago by
Dan Sheppard110
WTSI, Hinxton, UK
Dan Sheppard110 wrote:

I've read a few answers concerning the transfer of large data quickly and reliably across the internet. I've not been able to find a tool which combines these features. Does anyone know if a combination of options in the commonly used tool achieves this?

  • Multiple TCP streams or UDP for very fast transfer of bulk data
  • Similarly sensible re disk writes, threads, poll/select and copying to get stuff onto disk quickly
  • Checking of ownership and permissions as well as checksumming at both ends
  • Handles multiple small files as efficiently as very large files
  • Can run "rsync-style", only transferring diffs when appropriate
  • Has good security integrity and authenticity guarantees (secrecy not required)
  • good quality linux server and client with robust error detection and reporting.

I've had a look at rsync, fdt, bbcp, unison, aspera, udt/udr, &c, and all seem to offer a subset of these features?

Obviously, through a combination of tools and a load of glue scripts I could achieve this with existing tools, but before going to the effort, if it's just a magic combination of parameters, do let me know!

data • 12k views
ADD COMMENTlink modified 4.4 years ago by always_learning800 • written 4.4 years ago by Dan Sheppard110
4
gravatar for Matt Shirley
4.4 years ago by
Matt Shirley8.0k
Cambridge, MA
Matt Shirley8.0k wrote:

As much as I hate to admit it, Aspera performs all of these functions except:

good quality linux server and client with robust error detection and reporting

The linux client is terrible to find (see update), and the command line client provides only a basic level of documentation. What about the BitTorrent protocol? I haven't seen anyone using it for NGS data transfers, but regardless of the ethics surrounding its popularity it's actually a great data transfer protocol, supporting private, encrypted transfers, UDP transfers, scales with file size and number of connections, performs strong checksumming, block-based partial transfers, and has several wonderful Linux clients. The only thing missing from your list would be checking ownership and permissions at each end.

Update: See Michelle's comment below. You should now be able to do sh <(curl -s http://demo.asperasoft.com/ascp-install-3.5.4.102989-linux-64.sh)

It looks like they also have a separate client download page now:  http://downloads.asperasoft.com/en/downloads/50

ADD COMMENTlink modified 2.7 years ago • written 4.4 years ago by Matt Shirley8.0k
2

All - Since multiple people commented it was hard to find the Aspera Linux client and unix-appropriate docs, we posted a self-extracting installer for our 'ascp' Linux command line binary, with man page.

http://demo.asperasoft.com/ascp-install-3.5.4.102989-linux-64.sh

Extract the contents, and run 'man ascp' for all details of usage.

We will find a more permanent home on our web site www.asperasoft.com) soon. Hopefully this is helpful, and if any questions or feedback feel free to write us at support@asperasoft.com.

Thank you,

Michelle

 

ADD REPLYlink written 2.7 years ago by michelle30
1

We have also added a permanent home for the ascp installer on our web site:

Current Release : http://download.asperasoft.com/download/sw/ascp-client/3.5.4/ascp-install-3.5.4.102989-linux-64.sh
General Download Page: http://downloads.asperasoft.com/en/downloads/50

# man ascp 

gives all usage.

If any other platforms (OS X, Win, Solaris, etc.) are needed please let us know. We support them but don't get as many requests for the standalone CLI.

Michelle

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by michelle30

That was my experience, too, Matt. Sadly, the main reason we're moving from the existing solution is terrible error reporting meaning that we discover far too late that large jobs have failed in subtle ways around release time causing delays. If people don't need this, then I guess Aspera is probably the way to go (if you can afford it).

ADD REPLYlink written 4.4 years ago by Dan Sheppard110

BitTorrent isn't something I'd considered. I'll check that out. Assuming you can create a partitioned network away from the Wild West, I don't think there should be any issues with the main network's nefarious uses. Might have to warn our networks guys, though, or it would scare them to death, :-).

ADD REPLYlink written 4.4 years ago by Dan Sheppard110
1

I just wrote a post benchmarking BitTorrent vs scp. I'm not going to benchmark against Aspera, since I don't have a server license, but I think as far as throughput it would go aspera,unison,udt > BitTorrent > scp,netcat,http,ftp,scp. The main benefit to using BitTorrent would be lightweight infrastructure and good, stable tools, as well as scalable distribution if you are sending data to more than one collaborator.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Matt Shirley8.0k
1

another benefit of bittorrent is that the data sources can be distributed across multiple locations. in realistic scenarios the download speeds are often capped at the source beyond one's reach. Simultaneous downloads from multiple sources is often substantially faster.

ADD REPLYlink written 4.4 years ago by Istvan Albert ♦♦ 74k

I would include bbcp in there as well. We have had some pretty good performance from it and it is almost a drop-in replacement for scp.

ADD REPLYlink written 4.4 years ago by Sean Davis23k

That's very interesting. Does bbcp have to be installed on both the source and sink client, or is it a "drop in replacement" in the sense that it only needs an ssh server on the receiving end?

ADD REPLYlink written 4.4 years ago by Matt Shirley8.0k

It needs to be installed on both ends, but by "installed", the executable needs to be in the user's path. For some useful details, see:

http://pcbunn.cacr.caltech.edu/bbcp/using_bbcp.htm

ADD REPLYlink written 4.4 years ago by Sean Davis23k

Have you had a look at BitTorrent Sync? (http://labs.bittorrent.com/experiments/sync.html) One-way or two-way secure encrypted synchronisation. I've never tried it but really like the concept.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Daniel3.5k

Yes, although not in the context of rsync-style folder synchronization on a server. It seems promising.

ADD REPLYlink written 4.4 years ago by Matt Shirley8.0k
3
gravatar for Alex Paciorkowski
4.4 years ago by
Rochester, NY USA
Alex Paciorkowski3.3k wrote:

We use GlobusOnline for this. Cheaper than Aspera, has a command line interface, works great with Linux, very secure. Based on Grid-FTP. Handles big files, small files, in-between files.

ADD COMMENTlink written 4.4 years ago by Alex Paciorkowski3.3k

That looks very promising. How easy are things like GridFTP to set up in your experience?

ADD REPLYlink written 4.4 years ago by Dan Sheppard110
1

Globus is simple to set up. You need to make an account & install a straight-forward tool at each endpoint. Directions & tutorial are on their website. Excellent user support if you have more detailed questions.

ADD REPLYlink written 4.4 years ago by Alex Paciorkowski3.3k

Thanks, Alex. I'll give that a go, along with maybe a couple of the other suggestions here and report back on how I get on. The Globus Connect thing looks like it gets round the usual firewall worries.

(In the past when I've tried things called "Grid" it's meant massive, flaky java apps and expensive and weird additional infrastructure, hence my caution, but this definitely looks like an exception to that).

ADD REPLYlink written 4.4 years ago by Dan Sheppard110

Just a note that at least one endpoint needs to be a full-fledged GridFTP. Globus connect, the simple installer, does not work for both endpoints of a transfer.

ADD REPLYlink written 4.4 years ago by Sean Davis23k
2

We now have Globus Connect-to-Globus Connect transfers working. See www.globus.org. 

ADD REPLYlink written 3.4 years ago by ianfost20

Is this possible without Globus Plus? And is it possible to get Globus Plus without a Provider license? I've been trying to figure this out for a few days but even after contacting support, it's been hard to figure out what the best path is. No issues with paying a small fee for a license, but the provider plan would be mega-overkill. The firewall requirements for the full Globus server are pretty severe (in terms of # of ports), and Aspera Point-to-Point was surprisingly expensive when I got a quote recently. I only need to transfer a few hundred GB per month, but of course they're large gzipped files, not the sort of thing I want to rely on standard FTP or Dropbox for.

ADD REPLYlink written 2.4 years ago by Adamc530
1

I'd try having a direct conversation with Globus support about your exact requirements, if you haven't already done so. There may have been some changes since this answer was posted ~2 years ago. Globus ought to be ideal for the kind of transfers you mention. Usually they will work directly with your network admins to manage port/firewall issues. That's a common hurdle across institutions -- maybe best to just get everyone a call together?
 

ADD REPLYlink written 2.4 years ago by Alex Paciorkowski3.3k

Globus connect personal is just fine for this kind of thing, is free, and does not have the same requirements for open firewall ports (it has none).  If you want to use the "sharing" feature, then you'll need a "plus" account for about $7/month.  I agree with Alex that having a direct conversation with the Globus folks is never a bad idea, though.

ADD REPLYlink written 2.4 years ago by Sean Davis23k

Thanks for sharing this. It looks very slick.

ADD REPLYlink written 4.4 years ago by Matt Shirley8.0k
2
gravatar for Istvan Albert
4.4 years ago by
Istvan Albert ♦♦ 74k
University Park, USA
Istvan Albert ♦♦ 74k wrote:

File synchronization is far more complex of a problem than one would anticipate. For example checking permission and ownership is not nearly as simple as it sounds - there are some security implications plus other limitations.

As far as I know Unison is by far the most sophisticated because it offers a replication in both directions (and that is a far more complicated task) whereas most of the other tools that you list are just transfer tools.

ADD COMMENTlink written 4.4 years ago by Istvan Albert ♦♦ 74k

Unison is slightly more painful than the other options to install, so I've been going by the manual on its features. One thing I couldn't tell was how efficiently it handled the transfer process, either directly or by delegating it to a third party. We get substantial (ie orders of magnitude) better transfer rates using either a FDT type multi-TCP or UDR/UDT style UDP system than we do from rsync+ssh. The bit of the unison manual which talks about this seems to suggest we'd only get the latter's performance? One of the reasons I like the look of Unison is we have some other, unrelated multiple-synced filesystem annoyances which unison could potentially also solve.

While full filesystem security checking is painful, all the mistakes we've had so far have been wrong user/group/perms checking which benefit from just stat(2)/chown(2)/chmod(2) of the kind which rsync tidies nicely. Of course, when we stop these happening we'll start seeing the more exotic issues, :-).

ADD REPLYlink written 4.4 years ago by Dan Sheppard110
1
gravatar for Gabriel R.
4.4 years ago by
Gabriel R.2.1k
Center for Geogenetik Københavns Universitet
Gabriel R.2.1k wrote:

There is another possibility. If your collaborators are just interested in a specific chunk, just stick 'em on an ftp and samtools index/tabix index your files and they will be able to call them remotely. If they need the whole thing, then this approach is no good.

ADD COMMENTlink written 4.4 years ago by Gabriel R.2.1k

(for completeness, I should say that we have a site-wide Aspera licence, so the price isn't an issue for us, but that's irrelevant for other readers of this question).

ADD REPLYlink written 4.4 years ago by Dan Sheppard110
1
gravatar for Jeremy Leipzig
4.4 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

From what i understand, Aspera is good but it has a very high price tag. Data Expedition is not quite as slick but it performs similarly in many respects.

ADD COMMENTlink written 4.4 years ago by Jeremy Leipzig17k
1
gravatar for Seth Noble
4.4 years ago by
Seth Noble10
Seth Noble10 wrote:

Apologies for posting about my own company, but it does seem relevant to the OP.

My company, Data Expedition, Inc. produces commercial software that covers most, if not all, of the above requirements. As Jeremy implied, our utilities are more technically focused and much lower priced than other offerings.

Our ExpeDat and SyncDat software use proprietary UDP data transport to "fill the pipe". Meta data like unicode file names, date stamps, and unix permissions are preserved. Security and integrity are guaranteed. The ExpeDat command line client, "movedat", has many features to enable scripting and embedded operation. It has a "Streaming Folders" mode that lets you transfer billions of tiny files in a single stream. The SyncDat product does directory comparison and only transfers changed files, similar in some respects to rsync. Free trials of both are available on our website (no need to talk to anyone or get "approved" just to try it).

I'm happy to answer any questions here or offline, and we have full documentation and technical notes publicly available online as well.

ADD COMMENTlink written 4.4 years ago by Seth Noble10
1
gravatar for always_learning
4.4 years ago by
Doha, Qatar
always_learning800 wrote:

Hi all.

These are few file transferring tools which were used copying files with NGS analysis.

  1. http://udt.sourceforge.net/software.html
  2. http://monalisa.cern.ch/FDT/
  3. Tsunami — http://tsunami-udp.sourceforge.net/

Lets make your though on this.

Thanks Syed

ADD COMMENTlink modified 4.4 years ago by Istvan Albert ♦♦ 74k • written 4.4 years ago by always_learning800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1353 users visited in the last hour