Hi
Is there anyone help me about standard DNA compression algorithms cod such as CTW+LZ, bzip2, gzip2, etc.
Thank you.
Hi
Is there anyone help me about standard DNA compression algorithms cod such as CTW+LZ, bzip2, gzip2, etc.
Thank you.
There are plenty of open source implementations of compression algorithms, both general purpose and DNA-optimized. Here are some links:
General compression algorithms:
Code for gzip - https://ftp.gnu.org/gnu/gzip/
Code for bzip2 - http://www.bzip.org/downloads.html
DNA compression algorithms:
Code for LW-FQZip - http://csse.szu.edu.cn/staff/zhuzx/lwfqzip2/Download&Installation.html
Code for fqzcomp - https://sourceforge.net/projects/fqzcomp/files/?source=navbar
Good (but old, the literature has moved forward a lot since its publication) review on compression algorithms for sequencing data: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190
If you're strictly talking about generic DNA and you're fishing for cod, you might look into the 2bit and nib formats: http://genome.ucsc.edu/FAQ/FAQformat.html#format7
Ignoring ambiguous bases, you can encode the four bases of DNA (A, C, G, T) with two binary bits like so:
A : 00
C : 01
G : 10
T : 11
Consider the sequence AACATTGT, encoded as a series of bits:
00 00 01 00 11 11 10 11
Four of these two-bit codes make a byte (eight bits). You could string together bits into bytes:
00000100 11111011
In hexadecimal, these two bytes would be represented in low-byte or "little-endian" form as 10 EF
, for example.
If you have a long sequence, you can take these encoded bits, make a long stream of bytes from them, and compress them further with other compression algorithms, which generally look for repeats of byte patterns, storing those repeats in a table or dictionary, as is done with gzip.
Sequences in nature have repeats or motifs that effectively are like this kind of dictionary approach, and so help with squeezing redundancy out of a genome. Bases in motifs tend to have wiggle room though, so it's not perfect. But every bit counts.
Another approach is related to compression, storing differences between sequences and a reference genome: https://genome.cshlp.org/content/21/5/734.full
This is very efficient, so long as the space cost of storing the reference genome is ignored.
A good overview of general compression approaches is available here: http://mattmahoney.net/dc/dce.html
I think cmix is among the best compression tools out there, using training methods to predict content, but it gains compression efficiency at the cost of time and memory: https://github.com/byronknoll/cmix
More benchmarks on text compression available here: http://mattmahoney.net/dc/text.html
Compression efficiency is not everything, if you have to spend a lot of time in compression and extraction. Production tools generally aim for some optimized combination of compression efficiency and processing time.
There are several sequence compression specific tools mentioned in this thread that you can take a look at: uQ - small binary FASTQ
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This question requires a lot more information to get a good answer, please elaborate. It is unclear what you are asking for. In addition, if you show us the effort you have made to answer this we are more eager to help you and put you on the right track.
Hi Thanks I want to compair my algorithm by other compression algorithm.So i need cods of som standard algorithm ,becuse i have no time to cod them.
What's "cod"?
It's better to be Matlab
I'm guessing the typos confused Wouter. Please proofread your posts to avoid such confusion.
Hi I want to compair my algorithm by other compression algorithm.So i need cods of som standard algorithm ,becuse i have no time to cod them.
I understand that English may not be your first language, but it's also not mine. Please spend some more time on spelling and grammar since you are making your posts hard to read.
Copy-pasting comments does not help your case. Please correct the typos by editing the original comments, or better, follow up with more detail and ensure things are spelled fine. Thank you!