Question: fastq compression tools of choice
2
gravatar for Richard
20 months ago by
Richard470
Canada
Richard470 wrote:

Hi all,

I'll be trying out a few compression tools for fastq files.   So far on my list I have the following:

  1. dsrc
  2. lrzip
  3. gzip
  4. bgzf

Anyone have any good/poor experience with any of the above, or other options?

I'll be trying them all plotting compression ratio vs. (comp, decomp) cpu time, but I'm interested if anyone has a reason to not consider any of the above, or if there are other tools that should be considered.

Indexing and RAM usage are not of concern.

EDIT Oct 28, 2015:  We have tested lrzip, gzip, dsrc, bzip2, and others and found that by far dsrc is the best tool for fastq compression.   It is the fastest to compress and has the highest compression ratio.   Are there other folks out there using dsrc?

 

thanks,

RIchard

 

 

 

compression fatq • 2.0k views
ADD COMMENTlink modified 18 months ago • written 20 months ago by Richard470
6

What do you need out of compression? Fast compression time? Fast extraction time? Best compression efficiency? Low run-time memory usage? Do you need indexing (random access)?

Compression is a deep subject. Different algorithms have different characteristics that make them suitable for different use cases. You probably need to specify your criteria, first, before this becomes an answerable question.

ADD REPLYlink modified 20 months ago • written 20 months ago by Alex Reynolds18k
4
gravatar for Charles Plessy
20 months ago by
Charles Plessy1.6k
Japan
Charles Plessy1.6k wrote:

You may be interested in the article published in PLOS ONE (2013;8(3):e59190) by James K. Bonfield and Matthew V. Mahoney: Compression of FASTQ and SAM Format Sequencing Data.

ADD COMMENTlink written 20 months ago by Charles Plessy1.6k

The article is good but is not enough to guide your choice. What we need is tested tools. Gzip is trustable and is not likely to contain much bugs that would be detrimental to your data. What about all these new tools presented in the article? Which ones are dependable?

ADD REPLYlink written 18 months ago by Eric Normandeau9.2k
3
gravatar for Antonio R. Franco
20 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.3k wrote:

I would say that the gzip format will make the compressed file compatible with more aplications

ADD COMMENTlink written 20 months ago by Antonio R. Franco3.3k
3

And if you opt for block-gzip compression, you get that backwards-compatibility, plus the ability to utilize multiple cores for the compression and decompression, e.g. via pbgzip. (It comes at a small cost of compression ratio compared to normal gzip, but since you get to utilize multiple cores you can probably recover that by increasing the gzip compression level)

ADD REPLYlink written 20 months ago by Len Trigg700
1
gravatar for Alex Reynolds
20 months ago by
Alex Reynolds18k
Seattle, WA USA
Alex Reynolds18k wrote:

Algorithms that do well with text compression are probably worth investigating, insofar as uncompressed FASTQ is structured text. This site offers a pretty comprehensive comparison of various algorithms as applied to different corpora (Wikipedia, XML, etc.).

ADD COMMENTlink written 20 months ago by Alex Reynolds18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 932 users visited in the last hour