I'm looking for a comparison of DNA compression algorithms used nowadays and I'm unable to find one. Is there any?
Thank you!
I'm looking for a comparison of DNA compression algorithms used nowadays and I'm unable to find one. Is there any?
Thank you!
There is not much special about DNA data that would warrant a specialized compression algorithm. The most space-efficient representation of nucleotides represented as a binary file encoded in 2-bits, or a bit more versatile in .2bit format keeping headers and also 'N' characters.
The Burrows-Wheeler-Transform has been used in some alignment tools lately, and is used e.g. in bzip2. Afaik, bzip2 is one of the best (in sense of compression) general-purpose compression algorithm around. Like any other lossless compression it can be also used for nucleotide data in FASTA and .2bit format.
For obvious reasons there is pretty limited use for lossy compression.
If you need to compress a genome or a bunch of sequences in fasta file, I agree, but if you want to compress sequences from a high throughtput project a specialized algorithm can do much better than bzip2. Read http://genome.cshlp.org/content/early/2011/01/18/gr.114819.110 to see why. Also, they argure there IS space for lossy compression (of the quality scores)
yes but that wasn't the question, op mentioned only 'DNA-compression', which I suggest is already odd, because you don't compress DNA (the molecule). Sometimes it is worthwhile to try to guess what is really meant, but I won't overdo this. If he/she wanted to learn about compression for NGS reads he/she could say so.
Bzip2 is old, and besides bzip (aka bzip1) was better but was replaced because of now expired patent problems. The modern alternative to bzip2 is bsc (https://github.com/IlyaGrebnov/libbsc) and it totally owns bzip2. Very impresive work.
Also for sure you do get better compression with dedicated tools, of which there are many. It depends on whether you want aligned or unaligned data, whether you want to include quality values and read identifiers or just sequences, whether you are talking about compression of lots of sequence fragments or of collections of whole genomes.
For modern FASTQ compressors try things like MiniCom, Spring, or PgRC. For aligned sequence data CRAM is the norm and by increasing the slice size and enabling things like lzma you can get better performance. (I also have a branch that uses bsc, just to see, and it did indeed do very well.)
Actually, lossy compression of DNA sequence has intrigued me for a while. DNA sequencing already include noise of technical and biological character that must be accommodated. Adding some noise from lossy compression may be tolerated under some conditions. E.g. in sequencing with massive coverage.
In case you want to compress multiple genomes of one organism, you should have a look at Genome Differential Compressor and it's paper.
Using GDC 'a whole human genome can be stored in less than 3.12 MB.' Of course, one has to store reference as well.
In my opinion, this is a very interesting approach for this specialized here have existed similar approaches to differential compression, so it is not really new. rsync for example uses delta-encoding (http://en.wikipedia.org/wiki/Delta_encoding).
Sequences are usually kept in bam format, which is compressed and indexed.
Lately, a compression algorithm specifically designed for NGS reads has been developed. CRAM but still under development and sperimental.
The topic was already partially discuseed here
You can check Sequence Compression Benchmark. It includes both specialized and general-purpose compressors.
The answer naturally depends on your definition of "best" and on what data you need to compress. Typically specialized sequence compressors provide the best compactness, but general purpose compressors are often faster. (However, both of these tendencies have exceptions).
Check out the latest genome compression algorithm published in Bioinformatics, 2012
http://www.ncbi.nlm.nih.gov/pubmed/22833526
DELIMINATE - A fast and efficient method for loss-lesscompression of genomic sequences. Bioinformatics. 2012
DELIMINATE is useful for compressing real-world data corresponding to (fasta formatted) genome fna or ffn files.
How much compression you achieve will depend not only the on the archive type you create, but on the application you use to compress it and the settings you use. If you want to compress something to use as little space as possible, you should definitely use 7z. You can even crank up the compression settings to save even more space, although it will take longer to compress and decompress. Most compression tools have settings to allow you to achieve a higher compression rate at a compromise of slower compression/decompression times and more RAM usage.
Manly
GTZ is a more recent compression technique that published significantly better performance for fastq compression than other methods - pigz, LFQC, Fqzcomp, LW-FQZip, QUIP, & DSRC2. Wondering how it compares to the other methods mentioned here like 7z and DELIMINATE.
Has anyone used GTZ and can speak to its performance?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you mean by "best"?
Mainly time complexity and space complexity
there is normally a tradeoff between them, but do you also take into account the compression ratio? (Because space complexity means how much memory the algorithm uses)