shannon entropy score
7
3
Entering edit mode
6.3 years ago

Hi all,

I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?

sequence sequencing • 4.3k views
1
Entering edit mode
2
Entering edit mode
6.3 years ago
Joseph Hughes ★ 3.0k

There is an R package called entropy.

1
Entering edit mode

Another R package could be infotheo.

2
Entering edit mode
5.1 years ago
sacha ★ 2.4k

'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :

>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA


You can run this oneliner to get Shannon entropy score for each sequence in your fasta.

seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if($i){H+=$i/$2*log($i/$2)/log(2)}}print$1,-H}'


which return :

seq1 0
seq2 1.84199

1
Entering edit mode
6.3 years ago
Gabriel R. ★ 2.9k

Here is a C++ implementation:

It hasn't been used/tested extensively but feel free to use the code.

1
Entering edit mode
6.3 years ago

BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:

bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5


The code is in BBDukF.java in the function averageEntropy().

1
Entering edit mode
5.1 years ago
vmicrobio ▴ 290

Give a try to biojava:

import java.util.*;

import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class Entropy {
public static void main(String[] args) {

Distribution dist = null;
try {
//create a biased distribution
dist =
DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());

//set the weight of a to 0.97
dist.setWeight(DNATools.a(), 0.97);

//set the others to 0.01
dist.setWeight(DNATools.c(), 0.01);
dist.setWeight(DNATools.g(), 0.01);
dist.setWeight(DNATools.t(), 0.01);
}
catch (Exception ex) {
ex.printStackTrace();
System.exit(-1);
}

//calculate the information content
double info = DistributionTools.bitsOfInformation(dist);
System.out.println("information = "+info+" bits");
System.out.print("\n");

//calculate the Entropy (using the conventional log base of 2)
HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);

//print the Entropy of each residue
System.out.println("Symbol\tEntropy");
for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
Symbol sym = (Symbol)i.next();
System.out.println(sym.getName()+ "\t" +entropy.get(sym));
}
}
}

0
Entering edit mode
6.3 years ago

If found this on the net. Next step would be to implement it for a NGS use

http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set

st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol

print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
ctr = 0
for sym in stList:
if sym == symbol:
ctr += 1
freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))

0
Entering edit mode

What do you mean adapt for NGS use? Are you wanting to calculate the entropy on a per site basis on the genome, based on the bases in reads that are aligned to that position?

0
Entering edit mode

the objective would be to do it on a defined region that could be (or not) aligned to a reference

0
Entering edit mode
0
Entering edit mode

Sorry, but how's this relevant to OP's question?