Question: fastq compression tools of choice
gravatar for Richard
3.4 years ago by
Richard550 wrote:

Hi all,

I'll be trying out a few compression tools for fastq files.   So far on my list I have the following:

  1. dsrc
  2. lrzip
  3. gzip
  4. bgzf

Anyone have any good/poor experience with any of the above, or other options?

I'll be trying them all plotting compression ratio vs. (comp, decomp) cpu time, but I'm interested if anyone has a reason to not consider any of the above, or if there are other tools that should be considered.

Indexing and RAM usage are not of concern.

EDIT Oct 28, 2015:  We have tested lrzip, gzip, dsrc, bzip2, and others and found that by far dsrc is the best tool for fastq compression.   It is the fastest to compress and has the highest compression ratio.   Are there other folks out there using dsrc?







compression fatq • 3.6k views
ADD COMMENTlink modified 3.2 years ago • written 3.4 years ago by Richard550

What do you need out of compression? Fast compression time? Fast extraction time? Best compression efficiency? Low run-time memory usage? Do you need indexing (random access)?

Compression is a deep subject. Different algorithms have different characteristics that make them suitable for different use cases. You probably need to specify your criteria, first, before this becomes an answerable question.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Alex Reynolds27k
gravatar for Charles Plessy
3.4 years ago by
Charles Plessy2.6k
Charles Plessy2.6k wrote:

You may be interested in the article published in PLOS ONE (2013;8(3):e59190) by James K. Bonfield and Matthew V. Mahoney: Compression of FASTQ and SAM Format Sequencing Data.

ADD COMMENTlink written 3.4 years ago by Charles Plessy2.6k

The article is good but is not enough to guide your choice. What we need is tested tools. Gzip is trustable and is not likely to contain much bugs that would be detrimental to your data. What about all these new tools presented in the article? Which ones are dependable?

ADD REPLYlink written 3.2 years ago by Eric Normandeau10k
gravatar for Antonio R. Franco
3.4 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.9k wrote:

I would say that the gzip format will make the compressed file compatible with more aplications

ADD COMMENTlink written 3.4 years ago by Antonio R. Franco3.9k

And if you opt for block-gzip compression, you get that backwards-compatibility, plus the ability to utilize multiple cores for the compression and decompression, e.g. via pbgzip. (It comes at a small cost of compression ratio compared to normal gzip, but since you get to utilize multiple cores you can probably recover that by increasing the gzip compression level)

ADD REPLYlink written 3.4 years ago by Len Trigg1.2k
gravatar for Alex Reynolds
3.4 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Algorithms that do well with text compression are probably worth investigating, insofar as uncompressed FASTQ is structured text. This site offers a pretty comprehensive comparison of various algorithms as applied to different corpora (Wikipedia, XML, etc.).

ADD COMMENTlink written 3.4 years ago by Alex Reynolds27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour