Question: Obtain conservation (or % ident) in a multiple sequence alignment
0
gravatar for Biojl
2.6 years ago by
Biojl1.5k
Barcelona
Biojl1.5k wrote:

I need to calculate the conservation or percentage of identity across several sequences in a multiple alignment (not pairwise but across all sequences). I wrote a function that works with clustal format where column conservation is marked with * (and therefore is just a matter of counting stars). I was wondering if someone would have an equivalent function for FASTA alignments.

def calc_ident(alignment_location):
    iden = 0.0 #Method1: Calculated divided by the total length of the multiple alignment
    iden2 = 0.0 #Method2: Calculated only with the positions suitable for evolutionary analysis (no gap in any sequence)
    f = open(alignment_location, "r")
    completo = f.read()
    stars = completo.count('*')
    f.close()
    handle = open(alignment_location, "rU")
    records = list(SeqIO.parse(handle, "clustal"))
    long_total = len(records[0].seq)
    iden = (stars*100/long_total)
    long_useful = calc_long(records) #Calculating length for score2
    if long_useful == 0: #Calculating score2
        iden2=0
    else:
        iden2 = (stars*100/long_useful)
    handle.close()
    return iden, iden2, long_total, long_useful

def calc_long(records):
    length = 0
    j=0
    while j<len(records[0].seq):
        i=0
        while i< len(records):
            if records[i].seq[j] == '-':
                break
            elif i==(len(records)-1):
                length += 1
            i+=1
        j+=1
    return length 

alignment python • 1.2k views
ADD COMMENTlink modified 2.6 years ago by larsskjærven0 • written 2.6 years ago by Biojl1.5k

Why don't you just convert fasta to clustal? (http://sequenceconversion.bugaco.com/converter/biology/sequences/clustal_to_fasta.php)

ADD REPLYlink written 2.6 years ago by Biomonika (Noolean)3.0k

That's what I do, but it's cumbersome to add middle steps when you can avoid them.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Biojl1.5k
1
gravatar for Felix Francis
2.6 years ago by
Felix Francis420
United States/University of Delaware
Felix Francis420 wrote:
You could calculate Shannon's entropy for each column of the alignment. Lower entropy value indicates more conserved columns.

https://github.com/ffrancis/Multiple-sequence-alignment-Shannon-s-entropy

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Felix Francis420

Is there any publication describing (or comparing it against other measures) the Shannon entropy measure as a value to characterize how good a multiple alignment is, with real data?

ADD REPLYlink written 2.6 years ago by Biojl1.5k
1

Re: “calculate the conservation or percentage of identity across several sequences in a multiple alignment”:

Yes, Shannon’s entropy would be a good option. It may also be used to approximately evaluate alignment quality when you don’t have any reference (not the best approach though). http://www.biomedcentral.com/1471-2148/10/210/

For evaluating alignment accuracy:

NORMD (Thompson et al., 2001), MUMSA (Lassmann and Sonnhammer, 2005)], SP score, TC score, Q-score (http://www.drive5.com/qscore/), TCS (http://www.tcoffee.org/Projects/tcs/), and other consistency and probabilistic scoring approaches may be used.

ADD REPLYlink written 2.5 years ago by Felix Francis420
0
gravatar for larsskjærven
2.6 years ago by
Norway
larsskjærven0 wrote:

In R you can use Bio3d:

library(bio3d)

aln <- read.fasta("aln.fasta")

ent <- entropy(aln)

# standard entropy score for a 22-letter alphabet

ent$H

# plot results

plot(ent$H, type="h")

 

see the documentation for the entropy() function: http://thegrantlab.org/bio3d/html/entropy.html

 
ADD COMMENTlink written 2.6 years ago by larsskjærven0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour