What Is Best Current Dna Compression Algorithm?
7
6
Entering edit mode
12.3 years ago
Martyix ▴ 110

I'm looking for a comparison of DNA compression algorithms used nowadays and I'm unable to find one. Is there any?

Thank you!

dna compression ngs sequencing fasta • 10k views
ADD COMMENT
2
Entering edit mode

What do you mean by "best"?

ADD REPLY
0
Entering edit mode

Mainly time complexity and space complexity

ADD REPLY
0
Entering edit mode

there is normally a tradeoff between them, but do you also take into account the compression ratio? (Because space complexity means how much memory the algorithm uses)

ADD REPLY
5
Entering edit mode
12.3 years ago
Michael 54k

There is not much special about DNA data that would warrant a specialized compression algorithm. The most space-efficient representation of nucleotides represented as a binary file encoded in 2-bits, or a bit more versatile in .2bit format keeping headers and also 'N' characters.

The Burrows-Wheeler-Transform has been used in some alignment tools lately, and is used e.g. in bzip2. Afaik, bzip2 is one of the best (in sense of compression) general-purpose compression algorithm around. Like any other lossless compression it can be also used for nucleotide data in FASTA and .2bit format.

For obvious reasons there is pretty limited use for lossy compression.

ADD COMMENT
1
Entering edit mode

If you need to compress a genome or a bunch of sequences in fasta file, I agree, but if you want to compress sequences from a high throughtput project a specialized algorithm can do much better than bzip2. Read http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110 to see why. Also, they argure there IS space for lossy compression (of the quality scores)

ADD REPLY
1
Entering edit mode

yes but that wasn't the question, op mentioned only 'DNA-compression', which I suggest is already odd, because you don't compress DNA (the molecule). Sometimes it is worthwhile to try to guess what is really meant, but I won't overdo this. If he/she wanted to learn about compression for NGS reads he/she could say so.

ADD REPLY
1
Entering edit mode

Bzip2 is old, and besides bzip (aka bzip1) was better but was replaced because of now expired patent problems. The modern alternative to bzip2 is bsc (https://github.com/IlyaGrebnov/libbsc) and it totally owns bzip2. Very impresive work.

Also for sure you do get better compression with dedicated tools, of which there are many. It depends on whether you want aligned or unaligned data, whether you want to include quality values and read identifiers or just sequences, whether you are talking about compression of lots of sequence fragments or of collections of whole genomes.

For modern FASTQ compressors try things like MiniCom, Spring, or PgRC. For aligned sequence data CRAM is the norm and by increasing the slice size and enabling things like lzma you can get better performance. (I also have a branch that uses bsc, just to see, and it did indeed do very well.)

ADD REPLY
0
Entering edit mode

Actually, lossy compression of DNA sequence has intrigued me for a while. DNA sequencing already include noise of technical and biological character that must be accommodated. Adding some noise from lossy compression may be tolerated under some conditions. E.g. in sequencing with massive coverage.

ADD REPLY
2
Entering edit mode
12.3 years ago
Leszek 4.2k

In case you want to compress multiple genomes of one organism, you should have a look at Genome Differential Compressor and it's paper.
Using GDC 'a whole human genome can be stored in less than 3.12 MB.' Of course, one has to store reference as well.

ADD COMMENT
0
Entering edit mode

In my opinion, this is a very interesting approach for this specialized here have existed similar approaches to differential compression, so it is not really new. rsync for example uses delta-encoding (http://en.wikipedia.org/wiki/Delta_encoding).

ADD REPLY
1
Entering edit mode
12.3 years ago

Sequences are usually kept in bam format, which is compressed and indexed.

Lately, a compression algorithm specifically designed for NGS reads has been developed. CRAM but still under development and sperimental.

The topic was already partially discuseed here

ADD COMMENT
0
Entering edit mode

Another discussion here.

ADD REPLY
1
Entering edit mode
4.4 years ago

You can check Sequence Compression Benchmark. It includes both specialized and general-purpose compressors.

The answer naturally depends on your definition of "best" and on what data you need to compress. Typically specialized sequence compressors provide the best compactness, but general purpose compressors are often faster. (However, both of these tendencies have exceptions).

ADD COMMENT
0
Entering edit mode
11.6 years ago
monzoorul • 0

Check out the latest genome compression algorithm published in Bioinformatics, 2012

http://www.ncbi.nlm.nih.gov/pubmed/22833526

DELIMINATE - A fast and efficient method for loss-lesscompression of genomic sequences. Bioinformatics. 2012

DELIMINATE is useful for compressing real-world data corresponding to (fasta formatted) genome fna or ffn files.

ADD COMMENT
0
Entering edit mode
7.8 years ago

How much compression you achieve will depend not only the on the archive type you create, but on the application you use to compress it and the settings you use. If you want to compress something to use as little space as possible, you should definitely use 7z. You can even crank up the compression settings to save even more space, although it will take longer to compress and decompress. Most compression tools have settings to allow you to achieve a higher compression rate at a compromise of slower compression/decompression times and more RAM usage.

Manly

ADD COMMENT
0
Entering edit mode
4.7 years ago
DavidStreid ▴ 90

GTZ is a more recent compression technique that published significantly better performance for fastq compression than other methods - pigz, LFQC, Fqzcomp, LW-FQZip, QUIP, & DSRC2. Wondering how it compares to the other methods mentioned here like 7z and DELIMINATE.

Has anyone used GTZ and can speak to its performance?

REF - https://www.ncbi.nlm.nih.gov/pubmed/29297296

ADD COMMENT

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6