Question

Shannon entropy of a DNA motif?

3

Entering edit mode

10.0 years ago

a1ultima ▴ 840

I have DNA motifs represented by position-weight-matrices (PWMs) a.k.a position-specific scoring matrices (PSSMs), in transfac form (motif names are shown in rows following "DE", numbered rows represent the observed frequencies of letters along the sequence such that row 0 is the first letter along the sequence, the last column shows the most representative letter for that position along the sequence and these can be given ambiguity codes (not A,G,C,T) if no particular letter is representative:

DE    SRF
0    0.0435    0.0217    0.8478    0.0870    G
1    0.1957    0.7174    0.0435    0.0435    C
2    0.0000    0.9782    0.0217    0.0000    C
3    0.0217    0.9782    0.0000    0.0000    C
4    0.6956    0.0217    0.0000    0.2826    A
5    0.0652    0.0217    0.0000    0.9130    T
6    1.0000    0.0000    0.0000    0.0000    A
7    0.0217    0.0000    0.0000    0.9782    T
8    0.9348    0.0000    0.0000    0.0652    A
9    0.3261    0.0217    0.0000    0.6522    T
10    0.0435    0.0000    0.9565    0.0000    G
11    0.0435    0.0217    0.9348    0.0000    G
XX
DE    HMG-1
0    0.0000    0.3846    0.6154    0.0000    G
1    0.0000    0.0000    0.2308    0.7692    T
2    0.0000    0.3077    0.0000    0.6923    T
3    0.0000    0.1539    0.7692    0.0769    G
4    0.0000    0.0769    0.0000    0.9230    T
5    0.4615    0.0769    0.2308    0.2308    N
6    0.2308    0.3846    0.0000    0.3846    N
7    0.0000    0.0769    0.1539    0.7692    T
8    0.0000    0.6154    0.0769    0.3077    C
XX

I would like to calculate the Shannon Entropy for each motif, could anybody advise me on how to do this? I am most comfortable with Python, so if you have a few lines of code available or can point me to some package that would let me do this, or even provide me with a formula that would be much appreciated.

entropy dna motif position-weight-matrix sequence • 9.7k views

ADD COMMENT • link 10.0 years ago by a1ultima ▴ 840

Ram · Answer 1 · 2014-05-06

7

Entering edit mode

10.0 years ago

a1ultima ▴ 840

Thanks to advice and code from here and the theory and formula from here:

$H_i = - \sum f_{a,i} \times \log_2 f_{a,i}$ ,

where

$f_{a,i}$ is the relative frequency of base or amino acid $a$ at position $i$

I've adapted a Python script to answer my own question, hope this helps others in future:

import numpy as np
def compute_entropy(motif):
    arr = np.array(motif)
    H = (arr.flatten() * np.log2(arr.flatten())).sum(axis=0) # Formula from: http://en.wikipedia.org/wiki/Sequence_logo
    print 'entropy:', -H.mean(), 'bits'

motif_path  = '../data/stamp_data/out/dreme_100bp_e0.05/SWU_SSD/e005FBP.txt'
fi          = open(motif_path,'r')
motifs      = []
motif       = []
while True:
    line = fi.readline().strip().lower()
    if line == "":
       break
    elif line.startswith('de'):
        header = line
        print header
    elif line == 'xx": # when it's ended process the cached motif
        if motif:
            compute_entropy(motif)
        motifs.append(motif)
        motif = []
    else: # append motif rows if the line is not a header, nor a delimiter, nor a break
        motif.append(map(float,line.split()[1:-1]))
fi.close()

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 10.0 years ago by a1ultima ▴ 840

0

Entering edit mode

The mean of the entropy at each base is absolutely not the entropy of a motif. You're saying for example, a single BP with p=1/4 for atcg, has the same randomness as a ten BP sequence of such p=1/4's, when clearly the longer sequence contains more mathematical information. You asked a programming forum and copied the code from someone who added a disclaimer that he doesn't know what a biologists' entropy is.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by karl.stamm 4.1k

0

Entering edit mode

@karl.stamm, Firstly I did not copy the code it's quite different, secondly I did check the StackOverflow post's formula against a source here and it completely agrees with the source, quoting the source: " $H_i = - \sum f_{a,i} \times \log_2 f_{a,i}$ , where $f_{a,i}$ is the relative frequency of base or amino acid $a$ at position $i$ " ...The person who posted from StackOverflow made one assumption, which is coincidentally the pivotal assumption that profile hidden-markov model-based DNA profile alignment models make.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by a1ultima ▴ 840

0

Entering edit mode

Everything has an i subscript. This formula is correct as a per-base entropy, but that's irrelevant for the motif. Consider a motif of fifty As and one 25%-anything base. There's clearly only four possibilities, all equally likely, but the mean of 51 entropies will be nearly zero. Being a uniform distribution, this should have the highest entropy, and H=2 bits at the 25%-anything base, but do you mean for it to be averaged with the constant bases? Under my model of using all possible sequences, the constant bases don't impact the motif entropy, but under your model, fifty As will drown out the signal. I guess it depends on your requirements, but I need to insist that a mean of H sub i is not the entropy of a motif.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by karl.stamm 4.1k

0

Entering edit mode

You sound quite convincing, which makes me feel you're onto something, but I can't quite understand what you're saying... which probably means I've some more reading to do, so cheers.

ADD REPLY • link 10.0 years ago by a1ultima ▴ 840

score 1 · Answer 2 · 2014-05-05

1

Entering edit mode

10.0 years ago

pcantalupo ▴ 120

Hello,

I just found this interesting website that calculates Shanon entropy http://www.shannonentropy.netmark.pl/ and gives you the formula.

Hope it helps,

Paul

ADD COMMENT • link 10.0 years ago by pcantalupo ▴ 120

0

Entering edit mode

Hi @Paul, i'm trying to constrain my strings to just 4 possible letters (A,G,T,C) the decimals in the matrix dictate the relative frequencies of each of these letters e.g. for HMG-1 the first position has a 0.384 chance of being a C and 0.6154 chance of being a G. The online calculator unfortunately does not care about that information...

ADD REPLY • link 10.0 years ago by a1ultima ▴ 840

Ram · Answer 3 · 2014-05-05

The entropy formula is negative the sum of p log p, summed across all p; where p is the probability of each possible outcome (and traditionally log base 2). For a coin flip, a fair coin has two possibilities at p=0.5, so you get entropy=1.66

Sorry this code is in R, not python:

-(0.5*log(0.5,2) + 0.5*log(0.2,2))

If the coin was skewed so that tails comes up 99% of the time, it should be intuitive that there is less randomness in the sequence of flips, entropy=0.08

-(0.99*log(0.99,2) + 0.01*log(0.01,2))

So what you need is to enumerate all possible sequences, which I don't exactly see from your input data. That's a matrix of some kind? It looks like you have probabilities of ATCG at each position, so this is a set of sequences, for DE-SRF you have 4^11 possible sequences to sum across, and for DE-HMG1 you have 4^8 sequences to sum over. For each possible sequence you can calculate the p by multiplying the choices.

So for DE-HMG you have the possible sequence AAAAAAAA with probability 0*0*0*0*0*0.46*0.23*0*0 which comes to zero, so the first p term drops out. But DE-HMG has a potential sequence CGCCCCCC with probability 0.38*0.23*0.30*0.15*0.07*0.07*0.38*0.07*0.61. So you better write an algorithm that walks across these choices. Something with recursion might be able to skip all those zero-cases more efficiently, because enumerating the 4^11 choices for DE-SRF might take a while.

Then of course beware that PCs have trouble summing a ton of tiny tiny numbers, so you might run into numerical stability issues. (for example did you know that 1+10^-50 is exactly equal to one?)