Hello, I would like to add to a Python script an estimation (quick and dirty) of the complexity of the DNA sequence already taken as input by the script itself. A solution would be to calculate the compression ratio, but I do not know how to calculate it: do you? Any other approach is also welcome.
Time ago I played with different formulas to compute the composition and complexity of a DNA sequence. I recopiled a series of routines for GC content, GC-skew, AT-skew, CpG density (composition), and linguistic complexity, Markov chains, Wootton & Federhen complexity, entropy, Trifonov's complexity and of course compression using Zlib (complexity).
If you know some Perl you can take a look: http://caballero.github.com/SeqComplex/
If you are looking to determine how much your DNA could be compressed, then you might want to search PubMed for "DNA data compression".
There are indeed different algorithms to compress DNA information -- therefore also different compression ratios for the same dataset according to the compression algorithm used, as :
Compression ratio = compressed size / uncompressed size
have a look at the algorithm used by NCBI dustmasker
DustMasker is a program that identifies and masks out low complexity parts of a genome using a new and improved DUST algorithm. The main advantages of the new algorithm are symmetry with respect to taking reverse complements, context insensitivity, and much better performance.
Possibly you could use the zlib module of the Python standard library to compress the DNA sequence you are considering. The compression ratio obtained for different sequences would allow you to rank them according to their computational complexity. (I don't know the first thing about complexity, just following your hint of using the compression ratio as a proxy for it).
Say, something like the following example where I measure compression for 3 sequences (an extremely repetitive one, one I just invented, and one which is randomly generated)
import random import zlib # s="AAAAAAAAAAAAAAAA" #0.6875 print float(len(zlib.compress(s)))/len(s) #1.42857142857 s="ACTGTACGTCCGTG" print float(len(zlib.compress(s)))/len(s) #0.366366366366 (for example) s="".join([random.choice(["A","C","G","T"]) for e in range(1,1000)]) print float(len(zlib.compress(s)))/len(s)
what do you think?