Which Dna Compression Algorithms Are Actually Used?
8
23
Entering edit mode
11.3 years ago

I've seen a few papers proposing algorithms to compress DNA, which as I understand is quite a challenge even for seasoned researchers in the area. Although some of those have been implemented (DNAcompress is one example I know of), I have no idea which of these algorithms/implementations are actually used by biologists or databases.

So, are any DNA compression method proposed in the computer science literature actually used in practice? And if so, what are the most popular ones?

dna • 18k views
1
Entering edit mode

Interestingly, there is a paper from 1999 with the title "Protein is incompressible". http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.1412&rep=rep1&type=pdf And http://www.dcc.uchile.cl/~gnavarro/ps/bibe10.pdf states directly that "In general, DNA and proteins are not compressible [...] using classical methods".

1
Entering edit mode

I experimented with various compression algorithms on Fastq data, and did a writeup at http://blog.malde.org/index.php/2011/06/22/compressing-biological-sequences/

0
Entering edit mode

Since this just got bumped, I'll note that there was just a review paper published on this topic in AMB which covers this question from a practitioners point of view, and the pros and cons of various approaches.

18
Entering edit mode
11.3 years ago
brentp 24k

Ewan Birney from EMBL has a recent series on this:

In the paper from their lab, they showed some very good impression using reference-based techniques. All worth a read.

Here is their implementation of the reference-based compression.

0
Entering edit mode

Very interesting indeed.

0
Entering edit mode

Thansk for the link to Birney's blog, I didn't know that one. Although I'm skeptical, especially to the premise that saving space is particularly important.

15
Entering edit mode
11.3 years ago
Gww ★ 2.7k

The most common form of DNA compression is very simple. The basic idea is to just binary encode the DNA sequence. For example, you could use 2 bit for sequences that only contain [ATGC] or 4 bit for sequences that contain other alternative bases such as [ATGCYNR]. Just by binary encoding the sequence you can cut the file size down 75% with 2 bit encoding, and 50% with 4 bit encoding.

The advantage of this method is that the file can be easily parsed again without needing complicated compression algorithms. Using a binary encoded DNA sequence reduces the memory foot print of a large DNA sequence such as human's as well.

As a side note binary encoding DNA sequences is quite common. Most hash based indexing methods hash DNA sequences using their 4 bit / 2 bit encoding to save space, furthermore, they tend to store the reference DNA sequence as a binary encoded string. For example, NovoAlign does this.

Burrows-Wheeler Transform based aligners store the BWT DNA sequence as a 2 bit binary encoded sequence.

EDIT:

You could take this further by block compressing the binary encoded blocks using zlib but this will have a speed trade off if you are parsing the files frequently. Samtools uses a block compressiong scheme with zlib to compress binary encoded SAM files.

0
Entering edit mode

Yes, this is quite straightforward, and so are RLE extensions of that -- but is this actually used?

0
Entering edit mode

@Anthony Labarre: I added an extra sentence giving a few examples

0
Entering edit mode

OK, thanks a lot!

0
Entering edit mode

Sorry, but binary encoding of the DNA sequence using 2 bits per letter is NOT compression!!!!!

In order to have compression for a DNA sequence one should show that it uses strictly less than 2 bits per letter!!!

The definition of compression is "In computer science and information theory, data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation" (from: http://en.wikipedia.org/wiki/Data_compression).

So in case of DNA sequence one has an alphabet of 4 letters which requires a 2 bits per letter when no compression is done. When one stores a DNA sequence in a plain text file then indeed the DNA sequence is stored using 8-bits (or more) per letter but this is plainly inefficient way of storing it (it is like storing a binary string in a plain text file)! the process of converting a DNA sequence which was stored in a plain text file (where 8-bit per character) to 2-bit per character is NOT compression because the original representation was 2-bit to begin with!

1
Entering edit mode

that is just 'semantics'. 2bit DNA is 'compressed' compared to a fasta file.

1
Entering edit mode

i agree with Pierre...

1
Entering edit mode

@kredens

i agree with Pierre...

This is not a popularity contest!

0
Entering edit mode

I also agree with Pierre. What he's saying is correct.

0
Entering edit mode

8 years later !!!! thank you ! :-D

1
Entering edit mode

that is just 'semantics'. 2bit DNA is 'compressed' compared to a fasta file.

Sorry, by this is totally wrong! I suggest very gently to read some books and literature about compression!

FASTA file is not a valid reference for computing the compression ratio!!!!

All DNA compression algorithms have to use as reference the 2 bits per nucleotide! If they are able to compress the DNA sequence using less than 2 bits then we are having real compression anything else above 2 bits (like FASTA format) is just semantic and has nothing to do with compression!

See here:

http://www.cs.tut.fi/~tabus/genml/

http://dl.acm.org/citation.cfm?id=1055711

0
Entering edit mode

Indeed, the only information that can't be compressed is totally random information which contains no predictable patterns at all. DNA contains many predictable patterns and repetitions. Even WinRAR should be able to compress DNA fairly well compared to task specific programs. winrar takes words like "this" "the" that repeat many times and marks them with a number. DNA also contains many times the same entry so WinRAR should be able to compress it very well.

2
Entering edit mode
11.3 years ago

Just for fun: the following C code will compact a sequence by counting the number of repeated symbols. The result is an ascii string :-)

#include<stdio.h>
#include<ctype.h>

int main(int argc,char** argv)
{
char buffer[30];
int count=0;
int prev=-1;
for(;;)
{
int c=fgetc(stdin);
if(isspace(c)) continue;
c=toupper(c);
if(c==EOF || prev!=c)
{
if(prev!=-1)
{
int printed=snprintf(buffer,30,"%d%c",count,prev);
if(printed<count)
{
fputs(buffer,stdout);
}
else
while(count>0)
{
fputc(prev,stdout);
--count;
}
}
if(c==EOF) break;
count=0;
prev=c;
}
++count;
}
fputc('\n',stdout);
return 0;
}


compile & execute:

>gcc code.c
>echo "AaAAAATTttTTGCCCC   CCATGCGGAAA" | ./a.out
6A6TG6CATGCGG3A

3
Entering edit mode

This is run-length encoding (RLE), and is very useful, especially for 454 data :-). Here's the obligatory Haskell implementation:

Prelude List Control.Arrow> map (head &&& length) (group "TTTTAAAAGGCCAAATT")

gives: [('T',4),('A',4),('G',2),('C',2),('A',3),('T',2)]

0
Entering edit mode

I.e. run length encoding. This would probably be great for 454 data :-)

1
Entering edit mode
11.3 years ago
Benm ▴ 710

Refer to Pierre's methods, I think there is another(or sometimes optimized) solution, try this base2hex.pl:

#!/usr/bin/perl
use strict;
use warnings;

my %base_hex=(
"AA"=>0,
"AC"=>1,
"AG"=>2,
"AT"=>3,
"CA"=>4,
"CC"=>5,
"CG"=>6,
"CT"=>7,
"GA"=>8,
"GC"=>9,
"GG"=>'a',
"GT"=>'b',
"TA"=>'c',
"TC"=>'d',
"TG"=>'e',
"TT"=>'f',
);
my @hex_base=sort(keys %base_hex);

my $hex=uc($ARGV[0]);
$hex=~s/(\w{2})/$base_hex{$1}/eg; print "encode:$hex\n";
my $seq=$hex;
$seq=~s/([^ACGT])/$hex_base[oct("0x".$1)]/eg; print "decode:$seq\n";


test:

>perl base2hex.pl AaAAAATTttTTGCCCCCCATGCGGAAA
encode: 000fff9554e680
decode: AAAAAATTTTTTGCCCCCCATGCGGAAA

1
Entering edit mode
11.3 years ago

Markus' work (http://genome.cshlp.org/content/21/5/734.full) is now being iteratively re-implemented into production-ready software at the EBI ENA:

Feature-wise the current version is equivalent to Markus' proof-of-concept code, but about an order of magnitude faster. Future versions will include better treatment of paired-end reads and the support of a variety of sequencing platforms, starting by SOLID but also others. Right now it's still very much in beta, so it's not officially ready for production.

0
Entering edit mode
0
Entering edit mode
11.3 years ago
Ketil 4.1k

My guess for the most commonly used compression would be gzip or bzip2, applied to fasta files.

Another common compression scheme is Burroughs-Wheeler compressed index as used for indexing for [?]most of the[?] maybe some short read aligners these days. Although the primary idea is indexing for fast searching, the compression means that the index is smaller than the original data. This is actually a variant of the suffix array, pioneered, I think, by Manzini and Ferragina.

And, as pointed out, another common scheme is binary encoding of nucleotides. This has its own format (TwoBit), and is used by BLAT and the UCSC browser, and also in various hashing schemes.

2
Entering edit mode

BWT aligners don't actually compress the genome, at least BowTie doesn't. It just stores a 2 bit encoded BWT transform, plus a bunch of suffix array indices and other data. The FM index, however, does compress the BWT and it's associated data in blocks

1
Entering edit mode

BWA does use the FM-index but the BWT string which makes up the core of the FM-index is kept uncompressed (it does use a 2-bit encoding though).

0
Entering edit mode

Interesting! I had just assumed they all used compression as well, since the BWT gets them half-way there. In my own experiments, compressed indices were a lot slower than plain suffix arrays, though. On the other hand, BWA appears to use the FM index, and refers to it from their paper (http://bioinformatics.oxfordjournals.org/content/26/5/589.full).

0
Entering edit mode

My understanding is that the BWT of a genome is likely to contain many long runs of a single character, and so is much more easily compressed (e.g. by RLE) than the original genome. But the BWT itself does not reduce the storage size of a genome

0
Entering edit mode

Yes, I think that's correct. Bzip used arithmetic coding, and bzip2 used some other coding (RLE+Huffman?) due to patent issues.

0
Entering edit mode
11.3 years ago
Ben Lange ▴ 190

I would think:

Not being able to trust raw sequencing data is a real problem for compression. If there are errors, it can be difficult to know what you can drop if anything. It also forces software to deal with alignment issues, which can really degrade the big O notation of genetically related search algorithms. That really brings up the question, what will it take to get a significantly higher accuracy in sequence data?

Once you have high accuracy data, you should be able to map larger and larger sequences to a library of known global identifiers. I'm learning so that may be what everyone means by leveraging a reference genome.

Another perspective: Many people are moving towards "Cloud Storage". Storing lots of people's music has similar scaling issues. If cloud providers are forced into essentially storing each person's hard drive, they will get overwhelmed because they're responsible for everyone's unique bytes. But if they can match an entire song to a single global identifier and then have 100Ks people just reference that identifier, you can scale that much more efficiently.

0
Entering edit mode

I think what Birney & co mean, is that by assuming a shared, known reference genome, you only need to store and communicate the differences - think of it as a pre-shared LZ dictionary. Even if you store the ref along with the diffs, it can be a net win. The hard part is quality information, where you need to make some sacrifices.

0
Entering edit mode
11.3 years ago

Looks like GenCompress is well referenced.

I heard about it from the Scientific American article Chain Letters & Evolutionary Histories.

They use this compression algorithm to get the distance measure (which then can be used to reconstruct the phylogeny) between genomes, chain letters, and plagiarized schoolwork. The method is simple:

1. Take 2 genomes
2. Compress the genomes individually to get sizes A and B (in bytes)
3. Compress the concatenation of the original genomes to get size C (in bytes)
4. (A + B) / C is the distance measure