Question

What Is Best Current Dna Compression Algorithm?

11

Entering edit mode

13.5 years ago

Martyix ▴ 120

I'm looking for a comparison of DNA compression algorithms used nowadays and I'm unable to find one. Is there any?

Thank you!

dna compression ngs sequencing fasta • 12k views

ADD COMMENT • link updated 5.6 years ago by Kirill Kryukov ▴ 10 • written 13.5 years ago by Martyix ▴ 120

2

Entering edit mode

What do you mean by "best"?

ADD REPLY • link 13.5 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Mainly time complexity and space complexity

ADD REPLY • link 13.5 years ago by Martyix ▴ 120

0

Entering edit mode

there is normally a tradeoff between them, but do you also take into account the compression ratio? (Because space complexity means how much memory the algorithm uses)

ADD REPLY • link 13.5 years ago by Michael 56k

score 5 · Answer 1 · 2012-01-03

5

Entering edit mode

13.5 years ago

Michael 56k

There is not much special about DNA data that would warrant a specialized compression algorithm. The most space-efficient representation of nucleotides represented as a binary file encoded in 2-bits, or a bit more versatile in .2bit format keeping headers and also 'N' characters.

The Burrows-Wheeler-Transform has been used in some alignment tools lately, and is used e.g. in bzip2. Afaik, bzip2 is one of the best (in sense of compression) general-purpose compression algorithm around. Like any other lossless compression it can be also used for nucleotide data in FASTA and .2bit format.

For obvious reasons there is pretty limited use for lossy compression.

ADD COMMENT • link 13.5 years ago by Michael 56k

1

Entering edit mode

If you need to compress a genome or a bunch of sequences in fasta file, I agree, but if you want to compress sequences from a high throughtput project a specialized algorithm can do much better than bzip2. Read http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110 to see why. Also, they argure there IS space for lossy compression (of the quality scores)

ADD REPLY • link 13.5 years ago by Stefano Berri 4.4k

1

Entering edit mode

yes but that wasn't the question, op mentioned only 'DNA-compression', which I suggest is already odd, because you don't compress DNA (the molecule). Sometimes it is worthwhile to try to guess what is really meant, but I won't overdo this. If he/she wanted to learn about compression for NGS reads he/she could say so.

ADD REPLY • link 13.5 years ago by Michael 56k

1

Entering edit mode

Bzip2 is old, and besides bzip (aka bzip1) was better but was replaced because of now expired patent problems. The modern alternative to bzip2 is bsc (https://github.com/IlyaGrebnov/libbsc) and it totally owns bzip2. Very impresive work.

Also for sure you do get better compression with dedicated tools, of which there are many. It depends on whether you want aligned or unaligned data, whether you want to include quality values and read identifiers or just sequences, whether you are talking about compression of lots of sequence fragments or of collections of whole genomes.

For modern FASTQ compressors try things like MiniCom, Spring, or PgRC. For aligned sequence data CRAM is the norm and by increasing the slice size and enabling things like lzma you can get better performance. (I also have a branch that uses bsc, just to see, and it did indeed do very well.)

ADD REPLY • link 5.9 years ago by jkbonfield ★ 1.3k

0

Entering edit mode

Actually, lossy compression of DNA sequence has intrigued me for a while. DNA sequencing already include noise of technical and biological character that must be accommodated. Adding some noise from lossy compression may be tolerated under some conditions. E.g. in sequencing with massive coverage.

ADD REPLY • link 13.5 years ago by Martin A Hansen 3.0k

score 2 · Answer 2 · 2012-01-03

2

Entering edit mode

13.5 years ago

Leszek 4.2k

In case you want to compress multiple genomes of one organism, you should have a look at Genome Differential Compressor and it's paper.
Using GDC 'a whole human genome can be stored in less than 3.12 MB.' Of course, one has to store reference as well.

ADD COMMENT • link 13.5 years ago by Leszek 4.2k

0

Entering edit mode

In my opinion, this is a very interesting approach for this specialized here have existed similar approaches to differential compression, so it is not really new. rsync for example uses delta-encoding (http://en.wikipedia.org/wiki/Delta_encoding).

ADD REPLY • link 13.5 years ago by Michael 56k

Ram · Answer 3 · 2012-01-03

1

Entering edit mode

13.5 years ago

Stefano Berri 4.4k

Sequences are usually kept in bam format, which is compressed and indexed.

Lately, a compression algorithm specifically designed for NGS reads has been developed. CRAM but still under development and sperimental.

The topic was already partially discuseed here

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.5 years ago by Stefano Berri 4.4k

0

Entering edit mode

Another discussion here.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 13.5 years ago by Ying W ★ 4.3k

score 1 · Answer 4 · 2019-11-28

You can check Sequence Compression Benchmark. It includes both specialized and general-purpose compressors.

The answer naturally depends on your definition of "best" and on what data you need to compress. Typically specialized sequence compressors provide the best compactness, but general purpose compressors are often faster. (However, both of these tendencies have exceptions).

Ram · Answer 5 · 2012-09-06

0

Entering edit mode

12.8 years ago

monzoorul • 0

Check out the latest genome compression algorithm published in Bioinformatics, 2012

http://www.ncbi.nlm.nih.gov/pubmed/22833526

DELIMINATE - A fast and efficient method for loss-lesscompression of genomic sequences. Bioinformatics. 2012

DELIMINATE is useful for compressing real-world data corresponding to (fasta formatted) genome fna or ffn files.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 12.8 years ago by monzoorul • 0

score 0 · Answer 6 · 2016-06-17

How much compression you achieve will depend not only the on the archive type you create, but on the application you use to compress it and the settings you use. If you want to compress something to use as little space as possible, you should definitely use 7z. You can even crank up the compression settings to save even more space, although it will take longer to compress and decompress. Most compression tools have settings to allow you to achieve a higher compression rate at a compromise of slower compression/decompression times and more RAM usage.

Manly

score 0 · Answer 7 · 2019-08-13

GTZ is a more recent compression technique that published significantly better performance for fastq compression than other methods - pigz, LFQC, Fqzcomp, LW-FQZip, QUIP, & DSRC2. Wondering how it compares to the other methods mentioned here like 7z and DELIMINATE.

Has anyone used GTZ and can speak to its performance?

REF - https://www.ncbi.nlm.nih.gov/pubmed/29297296