Question: shannon entropy score
3
gravatar for curiousbiologist
18 months ago by
France
curiousbiologist40 wrote:

Hi all,

I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?

sequencing sequence • 970 views
ADD COMMENTlink modified 3 months ago by sacha1.2k • written 18 months ago by curiousbiologist40
1

Entropy Of Dna Sequences

Calculating Shannon Entropy for DNA sequence?: http://math.stackexchange.com/questions/1405130/calculating-shannon-entropy-for-dna-sequence

ADD REPLYlink modified 18 months ago • written 18 months ago by Tonor420
2
gravatar for Joseph Hughes
18 months ago by
Joseph Hughes2.6k
Scotland, UK
Joseph Hughes2.6k wrote:

There is an R package called entropy.

ADD COMMENTlink written 18 months ago by Joseph Hughes2.6k
1

Another R package could be infotheo.

ADD REPLYlink written 18 months ago by ddiez1.7k
1
gravatar for Gabriel R.
18 months ago by
Gabriel R.2.4k
Center for Geogenetik KĂžbenhavns Universitet
Gabriel R.2.4k wrote:

Here is a C++ implementation:

https://github.com/grenaud/aLib/blob/0785cd32c32bd8b515b3a79daff4897833b0b63c/pipeline/filterReads.cpp

It hasn't been used/tested extensively but feel free to use the code.

ADD COMMENTlink written 18 months ago by Gabriel R.2.4k
1
gravatar for Brian Bushnell
18 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:

bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5

The code is in BBDukF.java in the function averageEntropy().

ADD COMMENTlink written 18 months ago by Brian Bushnell15k
1
gravatar for haro
3 months ago by
haro220
France
haro220 wrote:

Give a try to biojava:

import java.util.*;

import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class Entropy {
   public static void main(String[] args) {

      Distribution dist = null;
      try {
      //create a biased distribution
          dist =
               DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());

      //set the weight of a to 0.97
      dist.setWeight(DNATools.a(), 0.97);

      //set the others to 0.01
      dist.setWeight(DNATools.c(), 0.01);
      dist.setWeight(DNATools.g(), 0.01);
      dist.setWeight(DNATools.t(), 0.01);
   }
   catch (Exception ex) {
   ex.printStackTrace();
   System.exit(-1);
}

    //calculate the information content
    double info = DistributionTools.bitsOfInformation(dist);
    System.out.println("information = "+info+" bits");
    System.out.print("\n");

    //calculate the Entropy (using the conventional log base of 2)
    HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);

    //print the Entropy of each residue
    System.out.println("Symbol\tEntropy");
    for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
      Symbol sym = (Symbol)i.next();
      System.out.println(sym.getName()+ "\t" +entropy.get(sym));
    }
  }
}
ADD COMMENTlink written 3 months ago by haro220
1
gravatar for sacha
3 months ago by
sacha1.2k
France
sacha1.2k wrote:

'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :

>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA

You can run this oneliner to get Shannon entropy score for each sequence in your fasta.

seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if($i){H+=$i/$2*log($i/$2)/log(2)}}print $1,-H}'

which return :

seq1 0
seq2 1.84199
ADD COMMENTlink modified 3 months ago • written 3 months ago by sacha1.2k
0
gravatar for curiousbiologist
18 months ago by
France
curiousbiologist40 wrote:

If found this on the net. Next step would be to implement it for a NGS use

http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set

st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol

print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
    ctr = 0
    for sym in stList:
        if sym == symbol:
            ctr += 1
    freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
    ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))
ADD COMMENTlink written 18 months ago by curiousbiologist40

What do you mean adapt for NGS use? Are you wanting to calculate the entropy on a per site basis on the genome, based on the bases in reads that are aligned to that position?

ADD REPLYlink written 18 months ago by Tonor420

the objective would be to do it on a defined region that could be (or not) aligned to a reference

ADD REPLYlink written 18 months ago by curiousbiologist40
0
gravatar for ahmedakhokhar
18 months ago by
Belgium
ahmedakhokhar90 wrote:

Please see the publication http://bioinformatics.oxfordjournals.org/content/23/15/1875.full.pdf

ADD COMMENTlink written 18 months ago by ahmedakhokhar90

Sorry, but how's this relevant to OP's question?

ADD REPLYlink written 18 months ago by Ram15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour