Shannon entropy of a DNA motif?
3
3
Entering edit mode
10.0 years ago
a1ultima ▴ 840

I have DNA motifs represented by position-weight-matrices (PWMs) a.k.a position-specific scoring matrices (PSSMs), in transfac form (motif names are shown in rows following "DE", numbered rows represent the observed frequencies of letters along the sequence such that row 0 is the first letter along the sequence, the last column shows the most representative letter for that position along the sequence and these can be given ambiguity codes (not A,G,C,T) if no particular letter is representative:

DE    SRF
0    0.0435    0.0217    0.8478    0.0870    G
1    0.1957    0.7174    0.0435    0.0435    C
2    0.0000    0.9782    0.0217    0.0000    C
3    0.0217    0.9782    0.0000    0.0000    C
4    0.6956    0.0217    0.0000    0.2826    A
5    0.0652    0.0217    0.0000    0.9130    T
6    1.0000    0.0000    0.0000    0.0000    A
7    0.0217    0.0000    0.0000    0.9782    T
8    0.9348    0.0000    0.0000    0.0652    A
9    0.3261    0.0217    0.0000    0.6522    T
10    0.0435    0.0000    0.9565    0.0000    G
11    0.0435    0.0217    0.9348    0.0000    G
XX
DE    HMG-1
0    0.0000    0.3846    0.6154    0.0000    G
1    0.0000    0.0000    0.2308    0.7692    T
2    0.0000    0.3077    0.0000    0.6923    T
3    0.0000    0.1539    0.7692    0.0769    G
4    0.0000    0.0769    0.0000    0.9230    T
5    0.4615    0.0769    0.2308    0.2308    N
6    0.2308    0.3846    0.0000    0.3846    N
7    0.0000    0.0769    0.1539    0.7692    T
8    0.0000    0.6154    0.0769    0.3077    C
XX

I would like to calculate the Shannon Entropy for each motif, could anybody advise me on how to do this? I am most comfortable with Python, so if you have a few lines of code available or can point me to some package that would let me do this, or even provide me with a formula that would be much appreciated.

entropy dna motif position-weight-matrix sequence • 9.6k views
ADD COMMENT
7
Entering edit mode
10.0 years ago
a1ultima ▴ 840

Thanks to advice and code from here and the theory and formula from here:

H_i = - \sum f_{a,i} \times \log_2 f_{a,i} ,

where

f_{a,i} is the relative frequency of base or amino acid a at position i

I've adapted a Python script to answer my own question, hope this helps others in future:

import numpy as np
def compute_entropy(motif):
    arr = np.array(motif)
    H = (arr.flatten() * np.log2(arr.flatten())).sum(axis=0) # Formula from: http://en.wikipedia.org/wiki/Sequence_logo
    print 'entropy:', -H.mean(), 'bits'

motif_path  = '../data/stamp_data/out/dreme_100bp_e0.05/SWU_SSD/e005FBP.txt'
fi          = open(motif_path,'r')
motifs      = []
motif       = []
while True:
    line = fi.readline().strip().lower()
    if line == "":
       break
    elif line.startswith('de'):
        header = line
        print header
    elif line == 'xx": # when it's ended process the cached motif
        if motif:
            compute_entropy(motif)
        motifs.append(motif)
        motif = []
    else: # append motif rows if the line is not a header, nor a delimiter, nor a break
        motif.append(map(float,line.split()[1:-1]))
fi.close()
ADD COMMENT
0
Entering edit mode

The mean of the entropy at each base is absolutely not the entropy of a motif. You're saying for example, a single BP with p=1/4 for atcg, has the same randomness as a ten BP sequence of such p=1/4's, when clearly the longer sequence contains more mathematical information. You asked a programming forum and copied the code from someone who added a disclaimer that he doesn't know what a biologists' entropy is.

ADD REPLY
0
Entering edit mode

@karl.stamm, Firstly I did not copy the code it's quite different, secondly I did check the StackOverflow post's formula against a source here and it completely agrees with the source, quoting the source: "H_i = - \sum f_{a,i} \times \log_2 f_{a,i} , where f_{a,i} is the relative frequency of base or amino acid a at position i" ...The person who posted from StackOverflow made one assumption, which is coincidentally the pivotal assumption that profile hidden-markov model-based DNA profile alignment models make.

ADD REPLY
0
Entering edit mode

Everything has an i subscript. This formula is correct as a per-base entropy, but that's irrelevant for the motif. Consider a motif of fifty As and one 25%-anything base. There's clearly only four possibilities, all equally likely, but the mean of 51 entropies will be nearly zero. Being a uniform distribution, this should have the highest entropy, and H=2 bits at the 25%-anything base, but do you mean for it to be averaged with the constant bases? Under my model of using all possible sequences, the constant bases don't impact the motif entropy, but under your model, fifty As will drown out the signal. I guess it depends on your requirements, but I need to insist that a mean of H sub i is not the entropy of a motif.

ADD REPLY
0
Entering edit mode

You sound quite convincing, which makes me feel you're onto something, but I can't quite understand what you're saying... which probably means I've some more reading to do, so cheers.

ADD REPLY
1
Entering edit mode
10.0 years ago
pcantalupo ▴ 120

Hello,

I just found this interesting website that calculates Shanon entropy http://www.shannonentropy.netmark.pl/ and gives you the formula.

Hope it helps,

Paul

ADD COMMENT
0
Entering edit mode

Hi @Paul, i'm trying to constrain my strings to just 4 possible letters (A,G,T,C) the decimals in the matrix dictate the relative frequencies of each of these letters e.g. for HMG-1 the first position has a 0.384 chance of being a C and 0.6154 chance of being a G. The online calculator unfortunately does not care about that information...

ADD REPLY
1
Entering edit mode
10.0 years ago

The entropy formula is negative the sum of p log p, summed across all p; where p is the probability of each possible outcome (and traditionally log base 2). For a coin flip, a fair coin has two possibilities at p=0.5, so you get entropy=1.66

Sorry this code is in R, not python:

-(0.5*log(0.5,2) + 0.5*log(0.2,2))

If the coin was skewed so that tails comes up 99% of the time, it should be intuitive that there is less randomness in the sequence of flips, entropy=0.08

-(0.99*log(0.99,2) + 0.01*log(0.01,2))

So what you need is to enumerate all possible sequences, which I don't exactly see from your input data. That's a matrix of some kind? It looks like you have probabilities of ATCG at each position, so this is a set of sequences, for DE-SRF you have 4^11 possible sequences to sum across, and for DE-HMG1 you have 4^8 sequences to sum over. For each possible sequence you can calculate the p by multiplying the choices.

So for DE-HMG you have the possible sequence AAAAAAAA with probability 0*0*0*0*0*0.46*0.23*0*0 which comes to zero, so the first p term drops out. But DE-HMG has a potential sequence CGCCCCCC with probability 0.38*0.23*0.30*0.15*0.07*0.07*0.38*0.07*0.61. So you better write an algorithm that walks across these choices. Something with recursion might be able to skip all those zero-cases more efficiently, because enumerating the 4^11 choices for DE-SRF might take a while.

Then of course beware that PCs have trouble summing a ton of tiny tiny numbers, so you might run into numerical stability issues. (for example did you know that 1+10^-50 is exactly equal to one?)

ADD COMMENT
0
Entering edit mode

Hi @karl.stamm

Your last three paragraphs were hard to follow, but regards to the formula, how do I choose the base? is it equal to the number of possible outcomes (e.g. A, G, T, C are four possible outcomes => base = 4?).

I'm trying to constrain my strings to just 4 possible letters (A,G,T,C) the decimals in the matrix dictate the relative frequencies of each of these letters e.g. for HMG-1 the first position has a 0.384 chance of being a C and 0.6154 chance of being a G. The online calculator unfortunately does not care about that information...

ADD REPLY
0
Entering edit mode

Shannon Entropy should use log base two to measure in units of 'bits'. Other log bases will only scale the outcome, like changing from feet to meters or miles. The tricky part you have to do is characterize the possible sequences contained by the motif as a vector of probabilities. Each of your examples represent thousands of potential DNA sequences with varying likelihood, and that's why you want to measure entropy. Consider DE-SRF, it's 11 letters long, each which could be any of four nucleotides, so there's 4^11 probabilities between zero and one, summing to one. Shannon entropy is defined for that vector of nearly 4.2 million values. In my coin-flip example, this is a coin with 4.2 million sides, each with some weight that is calculable by walking your matrix.

ADD REPLY

Login before adding your answer.

Traffic: 2794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6