Question

Obtain conservation (or % ident) in a multiple sequence alignment

0

Entering edit mode

9.1 years ago

Biojl ★ 1.7k

I need to calculate the conservation or percentage of identity across several sequences in a multiple alignment (not pairwise but across all sequences). I wrote a function that works with clustal format where column conservation is marked with * (and therefore is just a matter of counting stars). I was wondering if someone would have an equivalent function for FASTA alignments.

def calc_ident(alignment_location):
    iden = 0.0 #Method1: Calculated divided by the total length of the multiple alignment
    iden2 = 0.0 #Method2: Calculated only with the positions suitable for evolutionary analysis (no gap in any sequence)
    f = open(alignment_location, "r")
    completo = f.read()
    stars = completo.count('*')
    f.close()
    handle = open(alignment_location, "rU")
    records = list(SeqIO.parse(handle, "clustal"))
    long_total = len(records[0].seq)
    iden = (stars*100/long_total)
    long_useful = calc_long(records) #Calculating length for score2
    if long_useful == 0: #Calculating score2
        iden2=0
    else:
        iden2 = (stars*100/long_useful)
    handle.close()
    return iden, iden2, long_total, long_useful

def calc_long(records):
    length = 0
    j=0
    while j<len(records[0].seq):
        i=0
        while i< len(records):
            if records[i].seq[j] == '-':
                break
            elif i==(len(records)-1):
                length += 1
            i+=1
        j+=1
    return length

alignment Python • 5.5k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by Biojl ★ 1.7k

0

Entering edit mode

Why don't you just convert fasta to clustal? (http://sequenceconversion.bugaco.com/converter/biology/sequences/clustal_to_fasta.php)

ADD REPLY • link 9.1 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

That's what I do, but it's cumbersome to add middle steps when you can avoid them.

ADD REPLY • link 9.1 years ago by Biojl ★ 1.7k

Ram · Answer 1 · 2015-03-06

1

Entering edit mode

9.1 years ago

Felix Francis ▴ 600

You could calculate Shannon's entropy for each column of the alignment. Lower entropy value indicates more conserved columns.

https://github.com/ffrancis/Multiple-sequence-alignment-Shannon-s-entropy

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by Felix Francis ▴ 600

0

Entering edit mode

Is there any publication describing (or comparing it against other measures) the Shannon entropy measure as a value to characterize how good a multiple alignment is, with real data?

ADD REPLY • link 9.1 years ago by Biojl ★ 1.7k

1

Entering edit mode

Re: "calculate the conservation or percentage of identity across several sequences in a multiple alignment":

Yes, Shannon's entropy would be a good option. It may also be used to approximately evaluate alignment quality when you don't have any reference (not the best approach though). http://www.biomedcentral.com/1471-2148/10/210/

For evaluating alignment accuracy:

NORMD (Thompson et al., 2001), MUMSA (Lassmann and Sonnhammer, 2005)], SP score, TC score, Q-score, TCS, and other consistency and probabilistic scoring approaches may be used.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.1 years ago by Felix Francis ▴ 600

Ram · Answer 2 · 2015-03-07

0

Entering edit mode

9.1 years ago

larsskjærven • 0

In R you can use Bio3d:

library(bio3d)

aln <- read.fasta("aln.fasta")
ent <- entropy(aln)

# standard entropy score for a 22-letter alphabet
ent$H

# plot results
plot(ent$H, type="h")

see the documentation for the entropy() function: http://thegrantlab.org/bio3d/html/entropy.html

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by larsskjærven • 0