dna compression cod
3
0
Entering edit mode
5.8 years ago

Hi

Is there anyone help me about standard DNA compression algorithms cod such as CTW+LZ, bzip2, gzip2, etc.

Thank you.

dna compression • 1.8k views
ADD COMMENT
2
Entering edit mode

This question requires a lot more information to get a good answer, please elaborate. It is unclear what you are asking for. In addition, if you show us the effort you have made to answer this we are more eager to help you and put you on the right track.

ADD REPLY
0
Entering edit mode

Hi Thanks I want to compair my algorithm by other compression algorithm.So i need cods of som standard algorithm ,becuse i have no time to cod them.

ADD REPLY
0
Entering edit mode

What's "cod"?

ADD REPLY
0
Entering edit mode

It's better to be Matlab

ADD REPLY
0
Entering edit mode

I'm guessing the typos confused Wouter. Please proofread your posts to avoid such confusion.

ADD REPLY
0
Entering edit mode

Hi I want to compair my algorithm by other compression algorithm.So i need cods of som standard algorithm ,becuse i have no time to cod them.

ADD REPLY
0
Entering edit mode

I understand that English may not be your first language, but it's also not mine. Please spend some more time on spelling and grammar since you are making your posts hard to read.

ADD REPLY
0
Entering edit mode

Copy-pasting comments does not help your case. Please correct the typos by editing the original comments, or better, follow up with more detail and ensure things are spelled fine. Thank you!

ADD REPLY
1
Entering edit mode
5.8 years ago
h.mon 35k

There are plenty of open source implementations of compression algorithms, both general purpose and DNA-optimized. Here are some links:

General compression algorithms:

Code for gzip - https://ftp.gnu.org/gnu/gzip/

Code for bzip2 - http://www.bzip.org/downloads.html

DNA compression algorithms:

Code for LW-FQZip - http://csse.szu.edu.cn/staff/zhuzx/lwfqzip2/Download&Installation.html

Code for fqzcomp - https://sourceforge.net/projects/fqzcomp/files/?source=navbar

Good (but old, the literature has moved forward a lot since its publication) review on compression algorithms for sequencing data: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190

ADD COMMENT
0
Entering edit mode

Thank you. Sorry i cant speak english very well.

ADD REPLY
1
Entering edit mode

I would suggest you install the free grammarly browser extension to help you with getting English right. It's a great tool (I use it, too).

See https://www.grammarly.com/

ADD REPLY
0
Entering edit mode

Thanks for sugestion.

ADD REPLY
1
Entering edit mode
5.8 years ago

If you're strictly talking about generic DNA and you're fishing for cod, you might look into the 2bit and nib formats: http://genome.ucsc.edu/FAQ/FAQformat.html#format7

Ignoring ambiguous bases, you can encode the four bases of DNA (A, C, G, T) with two binary bits like so:

A : 00
C : 01
G : 10
T : 11

Consider the sequence AACATTGT, encoded as a series of bits:

00 00 01 00 11 11 10 11

Four of these two-bit codes make a byte (eight bits). You could string together bits into bytes:

00000100 11111011

In hexadecimal, these two bytes would be represented in low-byte or "little-endian" form as 10 EF, for example.

If you have a long sequence, you can take these encoded bits, make a long stream of bytes from them, and compress them further with other compression algorithms, which generally look for repeats of byte patterns, storing those repeats in a table or dictionary, as is done with gzip.

Sequences in nature have repeats or motifs that effectively are like this kind of dictionary approach, and so help with squeezing redundancy out of a genome. Bases in motifs tend to have wiggle room though, so it's not perfect. But every bit counts.

Another approach is related to compression, storing differences between sequences and a reference genome: https://genome.cshlp.org/content/21/5/734.full

This is very efficient, so long as the space cost of storing the reference genome is ignored.

A good overview of general compression approaches is available here: http://mattmahoney.net/dc/dce.html

I think cmix is among the best compression tools out there, using training methods to predict content, but it gains compression efficiency at the cost of time and memory: https://github.com/byronknoll/cmix

More benchmarks on text compression available here: http://mattmahoney.net/dc/text.html

Compression efficiency is not everything, if you have to spend a lot of time in compression and extraction. Production tools generally aim for some optimized combination of compression efficiency and processing time.

ADD COMMENT
0
Entering edit mode
5.8 years ago
GenoMax 141k

There are several sequence compression specific tools mentioned in this thread that you can take a look at: uQ - small binary FASTQ

ADD COMMENT

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6