Question: How To Estimate The Complexity Of A Dna String
4
gravatar for Anima Mundi
8.5 years ago by
Anima Mundi2.8k
Italy
Anima Mundi2.8k wrote:

Hello, I would like to add to a Python script an estimation (quick and dirty) of the complexity of the DNA sequence already taken as input by the script itself. A solution would be to calculate the compression ratio, but I do not know how to calculate it: do you? Any other approach is also welcome.

python • 9.2k views
ADD COMMENTlink modified 8.5 years ago by JC12k • written 8.5 years ago by Anima Mundi2.8k
1

what do you mean by "complexity of the DNA sequence" ?

ADD REPLYlink written 8.5 years ago by Nicolas Rosewick9.2k

I mean a measure of the amount of computational effort needed to specify the DNA string.

ADD REPLYlink written 8.5 years ago by Anima Mundi2.8k
9
gravatar for JC
8.5 years ago by
JC12k
Mexico
JC12k wrote:

Time ago I played with different formulas to compute the composition and complexity of a DNA sequence. I recopiled a series of routines for GC content, GC-skew, AT-skew, CpG density (composition), and linguistic complexity, Markov chains, Wootton & Federhen complexity, entropy, Trifonov's complexity and of course compression using Zlib (complexity).

If you know some Perl you can take a look: http://caballero.github.com/SeqComplex/

ADD COMMENTlink written 8.5 years ago by JC12k

Wow, bookmarked ;).

ADD REPLYlink written 8.5 years ago by Anima Mundi2.8k

I'm very glad to have found this script!  I have one quick follow-up question: could you explain what the difference between entropy (ce) and Markov first order complexity (cm1) is?  My understanding was that they were measuring the same thing, but I'm getting slightly different values for the two measures.

ADD REPLYlink written 5.2 years ago by clairerw0
6
gravatar for Leonor Palmeira
8.5 years ago by
Leonor Palmeira3.7k
Liège, Belgium
Leonor Palmeira3.7k wrote:

If you are looking to determine how much your DNA could be compressed, then you might want to search PubMed for "DNA data compression".

There are indeed different algorithms to compress DNA information -- therefore also different compression ratios for the same dataset according to the compression algorithm used, as :

Compression ratio = compressed size / uncompressed size

ADD COMMENTlink written 8.5 years ago by Leonor Palmeira3.7k
5
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

have a look at the algorithm used by NCBI dustmasker

DustMasker is a program that identifies and masks out low complexity
parts of a genome using a new and improved DUST algorithm. The main
advantages of the new algorithm are symmetry with respect to taking
reverse complements, context insensitivity, and much better performance.
ADD COMMENTlink modified 8.5 years ago by Istvan Albert ♦♦ 85k • written 8.5 years ago by Pierre Lindenbaum131k
4
gravatar for xenophiliuslovegood
8.5 years ago by
xenophiliuslovegood170 wrote:

Possibly you could use the zlib module of the Python standard library to compress the DNA sequence you are considering. The compression ratio obtained for different sequences would allow you to rank them according to their computational complexity. (I don't know the first thing about complexity, just following your hint of using the compression ratio as a proxy for it).

Say, something like the following example where I measure compression for 3 sequences (an extremely repetitive one, one I just invented, and one which is randomly generated)

import random
import zlib
#
s="AAAAAAAAAAAAAAAA"
#0.6875
print float(len(zlib.compress(s)))/len(s)
#1.42857142857
s="ACTGTACGTCCGTG"
print float(len(zlib.compress(s)))/len(s)
#0.366366366366 (for example)
s="".join([random.choice(["A","C","G","T"]) for e in range(1,1000)])
print float(len(zlib.compress(s)))/len(s)

what do you think?

ADD COMMENTlink written 8.5 years ago by xenophiliuslovegood170
1

It works good. In some cases it fails but fits perfectly my "quick and dirty" scope.

ADD REPLYlink written 8.5 years ago by Anima Mundi2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2009 users visited in the last hour