Question: shannon entropy score
3
24 months ago by
France
curiousbiologist40 wrote:

Hi all,

I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?

sequencing sequence • 1.4k views
modified 9 months ago by sacha1.6k • written 24 months ago by curiousbiologist40
1

Entropy Of Dna Sequences

Calculating Shannon Entropy for DNA sequence?: http://math.stackexchange.com/questions/1405130/calculating-shannon-entropy-for-dna-sequence

2
24 months ago by
Joseph Hughes2.6k
Scotland, UK
Joseph Hughes2.6k wrote:

There is an R package called entropy.

1

Another R package could be infotheo.

1
24 months ago by
Gabriel R.2.5k
Center for Geogenetik Københavns Universitet
Gabriel R.2.5k wrote:

Here is a C++ implementation:

It hasn't been used/tested extensively but feel free to use the code.

1
24 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:

``````bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5
``````

The code is in BBDukF.java in the function averageEntropy().

1
9 months ago by
haro230
France
haro230 wrote:

Give a try to biojava:

``````import java.util.*;

import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class Entropy {
public static void main(String[] args) {

Distribution dist = null;
try {
//create a biased distribution
dist =
DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());

//set the weight of a to 0.97
dist.setWeight(DNATools.a(), 0.97);

//set the others to 0.01
dist.setWeight(DNATools.c(), 0.01);
dist.setWeight(DNATools.g(), 0.01);
dist.setWeight(DNATools.t(), 0.01);
}
catch (Exception ex) {
ex.printStackTrace();
System.exit(-1);
}

//calculate the information content
double info = DistributionTools.bitsOfInformation(dist);
System.out.println("information = "+info+" bits");
System.out.print("\n");

//calculate the Entropy (using the conventional log base of 2)
HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);

//print the Entropy of each residue
System.out.println("Symbol\tEntropy");
for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
Symbol sym = (Symbol)i.next();
System.out.println(sym.getName()+ "\t" +entropy.get(sym));
}
}
}
``````
1
9 months ago by
sacha1.6k
France
sacha1.6k wrote:

'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :

``````>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA
``````

You can run this oneliner to get Shannon entropy score for each sequence in your fasta.

``````seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if(\$i){H+=\$i/\$2*log(\$i/\$2)/log(2)}}print \$1,-H}'
``````

which return :

``````seq1 0
seq2 1.84199
``````
0
24 months ago by
France
curiousbiologist40 wrote:

If found this on the net. Next step would be to implement it for a NGS use

http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

``````# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set

st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol

print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
ctr = 0
for sym in stList:
if sym == symbol:
ctr += 1
freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))
``````

What do you mean adapt for NGS use? Are you wanting to calculate the entropy on a per site basis on the genome, based on the bases in reads that are aligned to that position?

the objective would be to do it on a defined region that could be (or not) aligned to a reference

0
24 months ago by
Belgium
ahmedakhokhar90 wrote:

Sorry, but how's this relevant to OP's question?

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.