How To Estimate The Complexity Of A Dna String
4
6
Entering edit mode
12.2 years ago
Anima Mundi ★ 2.9k

Hello, I would like to add to a Python script an estimation (quick and dirty) of the complexity of the DNA sequence already taken as input by the script itself. A solution would be to calculate the compression ratio, but I do not know how to calculate it: do you? Any other approach is also welcome.

python • 12k views
1
Entering edit mode

what do you mean by "complexity of the DNA sequence" ?

0
Entering edit mode

I mean a measure of the amount of computational effort needed to specify the DNA string.

13
Entering edit mode
12.2 years ago
JC 13k

Time ago I played with different formulas to compute the composition and complexity of a DNA sequence. I recopiled a series of routines for GC content, GC-skew, AT-skew, CpG density (composition), and linguistic complexity, Markov chains, Wootton & Federhen complexity, entropy, Trifonov's complexity and of course compression using Zlib (complexity).

If you know some Perl you can take a look: http://caballero.github.com/SeqComplex/

0
Entering edit mode

Wow, bookmarked ;).

0
Entering edit mode

I'm very glad to have found this script! I have one quick follow-up question: could you explain what the difference between entropy (ce) and Markov first order complexity (cm1) is? My understanding was that they were measuring the same thing, but I'm getting slightly different values for the two measures.

6
Entering edit mode
12.2 years ago

If you are looking to determine how much your DNA could be compressed, then you might want to search PubMed for "DNA data compression".

There are indeed different algorithms to compress DNA information -- therefore also different compression ratios for the same dataset according to the compression algorithm used, as :

Compression ratio = compressed size / uncompressed size

6
Entering edit mode
12.2 years ago

Possibly you could use the zlib module of the Python standard library to compress the DNA sequence you are considering. The compression ratio obtained for different sequences would allow you to rank them according to their computational complexity. (I don't know the first thing about complexity, just following your hint of using the compression ratio as a proxy for it).

Say, something like the following example where I measure compression for 3 sequences (an extremely repetitive one, one I just invented, and one which is randomly generated)

import random
import zlib
#
s="AAAAAAAAAAAAAAAA"
#0.6875
print float(len(zlib.compress(s)))/len(s)
#1.42857142857
s="ACTGTACGTCCGTG"
print float(len(zlib.compress(s)))/len(s)
#0.366366366366 (for example)
s="".join([random.choice(["A","C","G","T"]) for e in range(1,1000)])
print float(len(zlib.compress(s)))/len(s)


what do you think?

1
Entering edit mode

It works good. In some cases it fails but fits perfectly my "quick and dirty" scope.

5
Entering edit mode
12.2 years ago

have a look at the algorithm used by NCBI dustmasker

DustMasker is a program that identifies and masks out low complexity
parts of a genome using a new and improved DUST algorithm. The main
advantages of the new algorithm are symmetry with respect to taking
reverse complements, context insensitivity, and much better performance.