Question

How To Estimate The Complexity Of A Dna String

6

Entering edit mode

12.0 years ago

Anima Mundi ★ 2.9k

Hello, I would like to add to a Python script an estimation (quick and dirty) of the complexity of the DNA sequence already taken as input by the script itself. A solution would be to calculate the compression ratio, but I do not know how to calculate it: do you? Any other approach is also welcome.

python • 12k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 12.0 years ago by Anima Mundi ★ 2.9k

1

Entering edit mode

what do you mean by "complexity of the DNA sequence" ?

ADD REPLY • link 12.0 years ago by Nicolas Rosewick 10k

0

Entering edit mode

I mean a measure of the amount of computational effort needed to specify the DNA string.

ADD REPLY • link 12.0 years ago by Anima Mundi ★ 2.9k

Ram · Answer 1 · 2012-05-09

13

Entering edit mode

12.0 years ago

JC 13k

Time ago I played with different formulas to compute the composition and complexity of a DNA sequence. I recopiled a series of routines for GC content, GC-skew, AT-skew, CpG density (composition), and linguistic complexity, Markov chains, Wootton & Federhen complexity, entropy, Trifonov's complexity and of course compression using Zlib (complexity).

If you know some Perl you can take a look: http://caballero.github.com/SeqComplex/

ADD COMMENT • link 12.0 years ago by JC 13k

0

Entering edit mode

Wow, bookmarked ;).

ADD REPLY • link 12.0 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

I'm very glad to have found this script! I have one quick follow-up question: could you explain what the difference between entropy (ce) and Markov first order complexity (cm1) is? My understanding was that they were measuring the same thing, but I'm getting slightly different values for the two measures.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by clairerw • 0

score 6 · Answer 2 · 2012-05-09

If you are looking to determine how much your DNA could be compressed, then you might want to search PubMed for "DNA data compression".

There are indeed different algorithms to compress DNA information -- therefore also different compression ratios for the same dataset according to the compression algorithm used, as :

Compression ratio = compressed size / uncompressed size

score 6 · Answer 3 · 2012-05-09

Possibly you could use the zlib module of the Python standard library to compress the DNA sequence you are considering. The compression ratio obtained for different sequences would allow you to rank them according to their computational complexity. (I don't know the first thing about complexity, just following your hint of using the compression ratio as a proxy for it).

Say, something like the following example where I measure compression for 3 sequences (an extremely repetitive one, one I just invented, and one which is randomly generated)

import random
import zlib
#
s="AAAAAAAAAAAAAAAA"
#0.6875
print float(len(zlib.compress(s)))/len(s)
#1.42857142857
s="ACTGTACGTCCGTG"
print float(len(zlib.compress(s)))/len(s)
#0.366366366366 (for example)
s="".join([random.choice(["A","C","G","T"]) for e in range(1,1000)])
print float(len(zlib.compress(s)))/len(s)

what do you think?

Istvan Albert · Answer 4 · 2012-05-09

5

Entering edit mode

12.0 years ago

Pierre Lindenbaum 161k

have a look at the algorithm used by NCBI dustmasker

DustMasker is a program that identifies and masks out low complexity
parts of a genome using a new and improved DUST algorithm. The main
advantages of the new algorithm are symmetry with respect to taking
reverse complements, context insensitivity, and much better performance.

ADD COMMENT • link updated 12.0 years ago by Istvan Albert 100k • written 12.0 years ago by Pierre Lindenbaum 161k