Huge Ngs Data Storage And Transferring
8
9
Entering edit mode
11.3 years ago
Himwo ▴ 90

Dear Biostar member,

I am a new comer for NGS data analysis, and my research team are exploring the hardware configuration for it. As I know, the raw data files are in fastq format which are huge in size (several gigabytes each). I wonder what is the best strategy for storage and transferring the data for analysis. I guess, if I transfer the data through LAN or USB, it still takes a day to transfer 1 sample. It seems not a very effective way, could u please advice?

Besides, I wonder whether the software normally used for NGS data analysis support grid computering?

Many Thanks!

Him

next-gen sequencing fastq • 13k views
11
Entering edit mode
11.3 years ago

As part of your strategy, my advice would be do not store the sequence/quality data in Fastq. At least consider using BAM files; they contain a superset of the data found in Fastq, are compressed and the headers may be used to store metadata useful for tracking.

I would expect to see software that currently uses Fastq to start using BAM directly. BWA already does this and will align directly from name-sorted BAM files without going via Fastq.

4
Entering edit mode

this is in fact a very useful suggestion. since BAM files can be reverse-engineered (using SamToFastq for instance) keeping optimized and non redundant files should be mandatory. our experience tells us that doing so reduces storing needs down to one third of the raw csfasta+qual files we get from our SOLiD machine.

1
Entering edit mode

This is only true for Fastq, and e.g. SFF files from 454 contain additional data. And many tools do not understand BAM directly, but needs Fasta or Fastq.

0
Entering edit mode

YMMV. It works well if you're using HiSeqs; both the WTSI and Broad use this approach and dynamically convert to Fastq for those apps that need it.

8
Entering edit mode
11.3 years ago
Darked89 4.2k

This is new therefore potentially risky and unsupported by downstream applications, but for fastq data compression it may be the thing we need:

“Compression of genomic sequences in FASTQ format”

Program web site: http://sun.aei.polsl.pl/dsrc

EDIT

Compression size comparisons:

1641617178 my_data.fastq
504780159 my_data.fastq.gz
458491510 my_data.fastq.bam
400375082 my_data.fastq.bz2
364381700 my_data.fastq.xz
361403728 my_data.fastq.dsrc
326326261 my_data.fastq.bsc
302321958 my_data.fastq.dsrc_l


Best option (probably faster than bsc, but not strictly tested):

dsrc_gcc e -l my_data.fastq my_data.fastq.dsrc_l


xz took ages to compress the same file, dsrc even with -l was very fast. Since dsrc is not widely used it is untested in real world AFAIK. Still, with extra step of md5 checking it should be OK to use it i.e. for file transfers.

EDIT 2

Plus fastq 2 bam conversion using picard. Command (not sure if optimal):

java -Xms2048m -jar FastqToSam.jar \
FASTQ=my_data.fastq \
QUALITY_FORMAT=Standard \
OUTPUT=my_data.fastq.bam \
SAMPLE_NAME=SampName \
PLATFORM=Illumina

3
Entering edit mode

The key advantage of gzip is its decompression speed. BAM is compressed few times but decompressed many times. The decompression speed is important. Another minor advantage of gzip is it is widely available.

0
Entering edit mode

If you are concerned with decompression speed and backward compatibility with tools that expect FASTQ input, take a look at a compression tool I wrote called SeqDB. It takes an orthogonal approach to DSRC, and prioritizes speed and compatibility over compression ratio.

0
Entering edit mode

sure, but I would think of gzip or bzip, can they beat conventional compression?

0
Entering edit mode

Nice compression rate...Have you compared the dsrc compression result with the corresponding unaligned BAM file ? Cause as Keith said above, it is very convenient to have a compressed file that can also be used directly by the tools.

0
Entering edit mode

Also +1. Dsrc is interesting. I overlooked this paper.

0
Entering edit mode

This is my benchmark:

5468703216 Pool_21670_S1_L1_P1.fastq
2059443423 Pool_21670_S1_L1_P1.fastq.gz
1421754084 Pool_21670_S1_L1_P1.fastq.bz2
1326707847 Pool_21670_S1_L1_P1.dsrc
1292721319 Pool_21670_S1_L1_P1.dsrc.best
1125420602 Pool_21670_S1_L1_P1.fqz
1069023188 Pool_21670_S1_L1_P1.fqz.best
581800104 Pool_21670_S1_L1_P1.fastqz.fxq.zpaq
450469344 Pool_21670_S1_L1_P1.fastqz.fxb.zpaq
50461346 Pool_21670_S1_L1_P1.fastqz.fxh.zpaq


NOTE

.fqz means fqzcomp

.fastqz.fx[qbh].zpaq are three output files from fastqz

best mean trading compression speed for higher compression rate.

My conclusion

Although dsrc compress very well, it cannot decompress, the error message is like std::alloc() failed.

I then picked fqzcomp to compress my fastq files.

7
Entering edit mode
11.3 years ago

NGS technologies have several strong hardware implications at several steps of the entire process that have to be consider as a whole when building everything from scratch. usually a group tends to reuse already available machinery and networking, so optimizing the existing resources and appropriately evaluating future needs is mandatory to end up building the best NGS system for you.

you will have to consider that the sequencer itself will generate the raw data onto a computer (or little cluster, like the one SOLiD gives you attached to the sequencer) that actually controls it. these raw data is usually processed on a different computer, since the original one is focused on controlling the sequencing process and handling the raw data, so if you want to have your sequencer up and running as much as possible you wouldn't want to collapse its controlling computer with mapping or any other secondary/tertiary analysis.

the first issue that arises here is moving the data out from that machine to another place where you would store and analyze it. data sizes are definitely an issue, so the connection between the sequencer computer and the analysis machine should be as best as possible. a gigabit connection, as mentioned here, would be advisable, although if you aren't able to upgrade your network or if the line to your analysis machine goes through paths which you may not be able to control, you will have to calculate transfer times considering that you will have to move typically a few hundreds of GB out of the sequencer.

when you come to store data, you will also have to decide what to store and what to leave behind. for instance, it was hard for us to decide forgetting about raw images, but when we calculated the storing costs of those images we saw that it was cheaper to repeat the experiment rather than storing the images for a long time. take into account that if you don't store the raw images (typically a few TB of size) you not only save storing capacity, but also data transfer time.

once you have solved those basic issues (until now you only have unprocessed raw data) you will have to start thinking about mapping that data, which is typically a high resource demanding step. since it is very parallelizable concept, mappers do usually allow multicore awareness at least, and some of them are in fact able to be installed on supercomputers where job queues may be used. you will then have to decide which program or programs you want to use, and then think about the machine they will demand. again, typically you will end up requiring a little (or large, depending on your needs) multinode computer, where you should be able to perform the mapping step. from then on you will probably use the same cluster to perform further upstream analysis (i.e. variant calling).

as an example case I will give you some numbers we humbly deal with at our lab. as I've mentioned, we have a SOLiD machine that came with a "little cluster" attached, made of 1 head node and 3 computing nodes (8 cores and 16GB of memory each), with a shared storage of 10TB (online cluster). we are connected through a gigabit line to a local supercomputing center, so we build up there a customized cluster made of 1 head node and 5 computing nodes (8 cores and 24GB of memory each), with a shared storage of 5TB (offline cluster). our standard workflow generates a few hundreds of GB in .csfasta and .qual files, which we move from the online to the offline cluster in a couple of hours, and then we start mapping and calling for SNPs and small indels. this generates a few GB of results in BAM files and variants lists, which we access differently: we leave the BAM files on the remote cluster for archiving and visualization purposes only (launching IGV locally and pointing to the remote BAM files works perfectly), and we do the main research effort using the variants which represent a few MB only.

5
Entering edit mode
11.3 years ago

Supposing you have a decent (Gigabit) LAN transfers up to even terrabyte levels should be no problem. I just tested mine on the Maastricht University network, going outside to speedtest.net. Download speed is 165Mb/sec and upload speed 70Mb/sec. So about 100Mb/sec on average, meaning I could transfer 10 GB of data in less then 20 minutes even externally. (Sorry had to correct this, didn't have coffee yet, the measurements were in bits (b) not bytes (B) second). USB is not a good idea.

3
Entering edit mode
11.3 years ago

Money (or lack of it) may determine your storage strategy as well as your projected output of sequence data. Depending on your workflow it is not uncommon to use 1TB of data on a single project before cleaning up and compressing files.

Gigabyte LAN (at a minimum) is almost essential but manually transferring data via hard-drives is still an option if desperate. Speak to your local network guys for advice as you may have access to advanced filesystems (e.g. GPFS) or alternative transfer protocols.

Very high end: enterprise cluster and SAN. Our cluster charges 500 uk pounds per TB per year.

High end: two mirrored servers in separate buildings, much better than just tape backup. 10k uk pounds gets you about 48TB per server

High end 2 : cloud: Galaxy is now instanced on the amazon cloud. If you are a small group this may be the most cost effective solution: Worth exploring

Very Low end: e-sata to an external hard disk reader from you workstation. 2TB disk are cheap have a huge lifespan when turned off and stored correctly
Even though I have mirrored servers I still use this sometimes for off-line archiving for some projects.

1
Entering edit mode

You can also set up a private cloud computing infrastructure, with software like eucalyptus.

2
Entering edit mode
11.3 years ago
Samuel Lampa ★ 1.3k

If the LAN/Internet connection remains a problematic bottleneck, you might want to check out options to use the UDP-protocol for transfer. It can be made much faster than TCP since it is much less "verbose", and does not get interrupted all the time by verification messages for each data packet etc.

There is a commercial software for this: Aspera connect. Unfortunately I think it is rather pricey. I also have seen an open source project, UDT, which might be worth looking into.

Regarding the HPC (Cluster)-readiness of NGS software: AFAIK, that differs quite much, but generally it seems software quality and efficiency in this area is still far behind the ditto in areas like physics etc. Hopefully it will improve as the number of NGS developers and standard libraries increases...

0
Entering edit mode

In the comment of this post, Brad Chapman mentions two other fast-file-transfer software:

1. FDT
2. Tsunami

... where both are Open Source(!), no 2 is also UDP-based, like Aspera, and no 1 is TCP-based.

0
Entering edit mode

Or you could just create UDP torrents.

2
Entering edit mode
11.3 years ago
Thaman ★ 3.3k

You didn't mentioned which organims and technology (Roche, Illumina etc)?

So I want you to look at CLC Genomics Machine and discussion at Seqanswers

Most importantly Managing Data from Next-Gen Sequencing

2
Entering edit mode
8.5 years ago
Samuel Lampa ★ 1.3k

I'm not sure what your budget / and size of lab / group are, so depending on those factors, this might not be exactly what you are after, but you might find some valueable info in a paper that we wrote up on our experiences from implementing an HPC storage and compute resource for NGS sequence analysis at UPPMAX HPC center in Sweden. The paper is open access and available at the folloing link:

To summarize the relevant part about transferring data, from the paper, we are generally doing fine with a 10 GbE (Gigabit Ethernet) connection for uploading data from the machine (or a smaller local storage, depending on platform), to our cluster via RSync / SSH. Actually, all 3+ platforms using the cluster, use the same "entry" machine to upload the data.

0
Entering edit mode

nice thoughts displayed, very interesting numbers shown.