Question: Shannon entropy of a DNA motif?
2
a1ultima710 wrote:

I have DNA motifs represented by position-weight-matrices (PWMs) a.k.a position-specific scoring matrices (PSSMs), in transfac form (motif names are shown in rows following "DE", numbered rows represent the observed frequencies of letters along the sequence such that row 0 is the first letter along the sequence, the last column shows the most representative letter for that position along the sequence and these can be given ambiguity codes (not A,G,C,T) if no particular letter is representative:

DE    SRF
0    0.0435    0.0217    0.8478    0.0870    G
1    0.1957    0.7174    0.0435    0.0435    C
2    0.0000    0.9782    0.0217    0.0000    C
3    0.0217    0.9782    0.0000    0.0000    C
4    0.6956    0.0217    0.0000    0.2826    A
5    0.0652    0.0217    0.0000    0.9130    T
6    1.0000    0.0000    0.0000    0.0000    A
7    0.0217    0.0000    0.0000    0.9782    T
8    0.9348    0.0000    0.0000    0.0652    A
9    0.3261    0.0217    0.0000    0.6522    T
10    0.0435    0.0000    0.9565    0.0000    G
11    0.0435    0.0217    0.9348    0.0000    G
XX
DE    HMG-1
0    0.0000    0.3846    0.6154    0.0000    G
1    0.0000    0.0000    0.2308    0.7692    T
2    0.0000    0.3077    0.0000    0.6923    T
3    0.0000    0.1539    0.7692    0.0769    G
4    0.0000    0.0769    0.0000    0.9230    T
5    0.4615    0.0769    0.2308    0.2308    N
6    0.2308    0.3846    0.0000    0.3846    N
7    0.0000    0.0769    0.1539    0.7692    T
8    0.0000    0.6154    0.0769    0.3077    C
XX

I would like to calculate the Shannon Entropy for each motif, could anybody advise me on how to do this? I am most comfortable with Python, so if you have a few lines of code available or can point me to some package that would let me do this, or even provide me with a formula that would be much appreciated.

modified 5.4 years ago • written 5.4 years ago by a1ultima710
6
a1ultima710 wrote:

Thanks to advice and code from here and the theory and formula from here: , where is the relative frequency of base or amino acid at position I've adapted a Python script to answer my own question, hope this helps others in future:

import numpy as np

def compute_entropy(motif):     arr = np.array(motif)     H = (arr.flatten() * np.log2(arr.flatten())).sum(axis=0) # Formula from: http://en.wikipedia.org/wiki/Sequence_logo     print 'entropy:', -H.mean(), 'bits'

motif_path  = '../data/stamp_data/out/dreme_100bp_e0.05/SWU_SSD/e005FBP.txt' fi          = open(motif_path,'r')  motifs      = [] motif       = [] while True:     line = fi.readline().strip().lower()     if line == "":         break     elif line.startswith('de'):         header = line         print header     elif line == 'xx': # when it's ended process the cached motif         if motif:             compute_entropy(motif)         motifs.append(motif)         motif = []     else: # append motif rows if the line is not a header, nor a delimiter, nor a break         motif.append(map(float,line.split()[1:-1])) fi.close()

The mean of the entropy at each base is absolutely not the entropy of a motif. You're saying for example, a single BP with p=1/4 for atcg, has the same randomness as a ten BP sequence of such p=1/4's, when clearly the longer sequence contains more mathematical information.  You asked a programming forum and copied the code from someone who added a disclaimer that he doesn't know what a biologists' entropy is.

@karl.stamm, Firstly I did not copy the code it's quite different, secondly I did check the StackOverflow post's formula against a source Here and it completely agrees with the source, quoting the source: " , where is the relative frequency of base or amino acid at position " ...The person who posted from StackOverflow made one assumption, which is coincidentally the pivotal assumption that profile hidden-markov model-based DNA profile alignment models make.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by a1ultima710

Everything has an i subscript. This formula is correct as a per-base entropy, but that's irrelevant for the motif. Consider a motif of fifty As and one 25%-anything base. There's clearly only four possibilities, all equally likely, but the mean of 51 entropies will be nearly zero. Being a uniform distribution, this should have the highest entropy, and H=2 bits at the 25%-anything base, but do you mean for it to be averaged with the constant bases? Under my model of using all possible sequences, the constant bases don't impact the motif entropy, but under your model, fifty As will drown out the signal.  I guess it depends on your requirements, but I need to insist that a mean of H sub i is not the entropy of a motif.

You sound quite convincing, which makes me feel you're onto something, but I can't quite understand what you're saying... which probably means I've some more reading to do, so cheers.

1
pcantalupo120 wrote:

Hello,

I just found this interesting website that calculates Shanon entropy http://www.shannonentropy.netmark.pl/ and gives you the formula.

Hope it helps,

Paul

Hi @Paul, i'm trying to constrain my strings to just 4 possible letters (A,G,T,C) the decimals in the matrix dictate the relative frequencies of each of these letters e.g. for HMG-1 the first position has a 0.384 chance of being a C and 0.6154 chance of being a G. The online calculator unfortunately does not care about that information...

1
karl.stamm3.5k wrote:

The entropy formula is negative the sum of p log p, summed across all p; where p is the probability of each possible outcome (and traditionally log base 2). For a coin flip, a fair coin has two possibilities at p=0.5, so you get entropy=1.66

Sorry this code is in R, not python:

-(0.5*log(0.5,2) + 0.5*log(0.2,2))

If the coin was skewed so that tails comes up 99% of the time, it should be intuitive that there is less randomness in the sequence of flips, entropy=0.08

-(0.99*log(0.99,2) + 0.01*log(0.01,2))

So what you need is to enumerate all possible sequences, which I don't exactly see from your input data. That's a matrix of some kind? It looks like you have probabilities of ATCG at each position, so this is a set of sequences, for DE-SRF you have 4^11 possible sequences to sum across, and for DE-HMG1 you have 4^8 sequences to sum over. For each possible sequence you can calculate the p by multiplying the choices.

So for DE-HMG you have the possible sequence AAAAAAAA with probability 0*0*0*0*0*0.46*0.23*0*0 which comes to zero, so the first p term drops out. But DE-HMG has a potential sequence CGCCCCCC with probability 0.38*0.23*0.30*0.15*0.07*0.07*0.38*0.07*0.61.  So you better write an algorithm that walks across these choices. Something with recursion might be able to skip all those zero-cases more efficiently, because enumerating the 4^11 choices for DE-SRF might take a while.

Then of course beware that PCs have trouble summing a ton of tiny tiny numbers, so you might run into numerical stability issues. (for example did you know that 1+10^-50 is exactly equal to one?)

Hi @karl.stamm

Your last three paragraphs were hard to follow, but regards to the formula, how do I choose the base? is it equal to the number of possible outcomes (e.g. A, G, T, C are four possible outcomes => base = 4?).

i'm trying to constrain my strings to just 4 possible letters (A,G,T,C) the decimals in the matrix dictate the relative frequencies of each of these letters e.g. for HMG-1 the first position has a 0.384 chance of being a C and 0.6154 chance of being a G. The online calculator unfortunately does not care about that information...