Question: How To Estimate The Complexity Of A Dna String
4
8.5 years ago by
Anima Mundi2.8k
Italy
Anima Mundi2.8k wrote:

Hello, I would like to add to a Python script an estimation (quick and dirty) of the complexity of the DNA sequence already taken as input by the script itself. A solution would be to calculate the compression ratio, but I do not know how to calculate it: do you? Any other approach is also welcome.

python • 9.2k views
modified 8.5 years ago by JC12k • written 8.5 years ago by Anima Mundi2.8k
1

what do you mean by "complexity of the DNA sequence" ?

I mean a measure of the amount of computational effort needed to specify the DNA string.

9
8.5 years ago by
JC12k
Mexico
JC12k wrote:

Time ago I played with different formulas to compute the composition and complexity of a DNA sequence. I recopiled a series of routines for GC content, GC-skew, AT-skew, CpG density (composition), and linguistic complexity, Markov chains, Wootton & Federhen complexity, entropy, Trifonov's complexity and of course compression using Zlib (complexity).

If you know some Perl you can take a look: http://caballero.github.com/SeqComplex/

Wow, bookmarked ;).

I'm very glad to have found this script!  I have one quick follow-up question: could you explain what the difference between entropy (ce) and Markov first order complexity (cm1) is?  My understanding was that they were measuring the same thing, but I'm getting slightly different values for the two measures.

6
8.5 years ago by
Leonor Palmeira3.7k
Liège, Belgium
Leonor Palmeira3.7k wrote:

If you are looking to determine how much your DNA could be compressed, then you might want to search PubMed for "DNA data compression".

There are indeed different algorithms to compress DNA information -- therefore also different compression ratios for the same dataset according to the compression algorithm used, as :

Compression ratio = compressed size / uncompressed size

5
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

have a look at the algorithm used by NCBI dustmasker

``````DustMasker is a program that identifies and masks out low complexity
parts of a genome using a new and improved DUST algorithm. The main
advantages of the new algorithm are symmetry with respect to taking
reverse complements, context insensitivity, and much better performance.
``````
4
8.5 years ago by
xenophiliuslovegood170 wrote:

Possibly you could use the zlib module of the Python standard library to compress the DNA sequence you are considering. The compression ratio obtained for different sequences would allow you to rank them according to their computational complexity. (I don't know the first thing about complexity, just following your hint of using the compression ratio as a proxy for it).

Say, something like the following example where I measure compression for 3 sequences (an extremely repetitive one, one I just invented, and one which is randomly generated)

``````import random
import zlib
#
s="AAAAAAAAAAAAAAAA"
#0.6875
print float(len(zlib.compress(s)))/len(s)
#1.42857142857
s="ACTGTACGTCCGTG"
print float(len(zlib.compress(s)))/len(s)
#0.366366366366 (for example)
s="".join([random.choice(["A","C","G","T"]) for e in range(1,1000)])
print float(len(zlib.compress(s)))/len(s)
``````

what do you think?

1

It works good. In some cases it fails but fits perfectly my "quick and dirty" scope.