Question: How to display mismatched sequences from alignment when using Biopython
0
gravatar for genebow
3.3 years ago by
genebow160
USA/Chicago
genebow160 wrote:

When using Biopython alignment to align two protein sequences, the output does not show the mismatched amino acids. Would you please have some suggestions to display the mismatched amino acids by Biopython?

The following is the code for alignment by Biopython. I tried different options, the problem could not be resolved.

----------------------------------------------------------

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)
print('Score',alignments[0][2])
print(format_alignment(*alignments[0]))

----------------------------------------------------------------

The program output and the output from EMBI-EBI EMBOSS Water (Protein Alignment) are shown as follows:

Biopython:
HRTLGLLLHTQVSI-QQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HRTLGLLLHTQVSIKQ-LLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK


EMBL-EBI:
EMBOSS_001         1 HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYW     50
                     ||||||||||||||:||||||||||||||||||.||||||||||||||||
EMBOSS_001         1 HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYW     50

EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK     82
                     |||||||||||||||||||||||||||||:||
EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK     82

and in this picture: output

Thank you in advance!

biopython alignment • 4.1k views
ADD COMMENTlink modified 3.3 years ago by st.ph.n2.5k • written 3.3 years ago by genebow160
1

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 3.3 years ago by WouterDeCoster44k

Thanks! I will use this approach for formatting code.

ADD REPLYlink written 3.3 years ago by genebow160

Please say which version of Biopython you are using, since the pairwise code was rewritten in Biopython 1.68

ADD REPLYlink written 3.3 years ago by Peter5.9k
2
gravatar for Joe
3.3 years ago by
Joe18k
United Kingdom
Joe18k wrote:

Here's a bit of a bodge that might work. Note though that it doesn't take in to consideration amino acid similarity. It would be easy enough to code in with a few more elif statements though provided you can put together a table of what matches with what.

In this instance it just leaves a blank where there's no match, by simply iterating both strings from pairwise2 and checking equivalency. You could create a slightly fancier print output too if desired but I think this works.

In essence all you need to do is create a custom string printed in the middle of the 2 alignments. It's generated by appending to a list, and then joining the list to make another string. Standard use of monospaced fonts in terminals etc means that the alignments of all the character columns is taken care of for you.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
        if a == b:
                match.append('|')
        else:
                match.append(' ')


print(alignments[0][0])
print("".join(match))
print(alignments[0][1])

Yields:

HRTLGLLLHTQVSIQ-QLLKLPAECFHPKPKVNSV-LIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKH-AK
||||||||||||||  ||||||||||||||||||  |||||||||||||||||||||||||||||||||||||||||||||  ||
HRTLGLLLHTQVSI-KQLLKLPAECFHPKPKVNS-ALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMK-YAK
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Joe18k

It works! Thank you very much!

ADD REPLYlink written 3.3 years ago by genebow160

Great, if this answers your question satisfactorily, be sure to accept the answer (tick on the left hand side of the post).

ADD REPLYlink written 3.3 years ago by Joe18k

Also I'd maybe play around with the gap penalty and mismatch penalty a little bit since that alignment seems to be scoring gaps quite low.

For instance, it looks to me like the 2 gaps here should instead just be a mismatch, but I could be wrong...

   QVSIQ-QLLKL
   ||||  |||||
   QVSI-KQLLKL

Would make more sense to me as:

   QVSIQQLLKL
   |||| |||||
   QVSIKQLLKL
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Joe18k
1

Thank you! yes, that is a problem. What gap penalty do you use in localxx() for the correct result (second one)? It looks localms() function should be used. The following is the improved version.

from Bio import pairwise2

seqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
seqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

# Four parameters (2,-1,-1,-1) in alignment:Identical characters are given 2 points, 1 point is deducted for each non-identical character.
# 1 point is deducted when opening a gap, and 1 point is deducted when extending it.

alignments = pairwise2.align.localms(seqQuery,seqFound, 2, -1, -1, -1)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
        if a == b:
                match.append('|')
        else:
                match.append('*')

m="".join(match)
s=[]
s.append(alignments[0][0]+'\n')
s.append(m+'\n')
s.append(alignments[0][1])

alignedSeqs="".join(s)
print('\n')
print(alignedSeqs)

The output is as follows:

HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
||||||||||||||*||||||||||||||||||*|||||||||||||||||||||||||||||||||||||||||||||*||
HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genebow160
1

Yep that looks to be more reasonable. Opening a gap is usually the most heavily penalised action AFAIK, so that makes for a much more parsimonious/high scoring alignment.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Joe18k

I didn't, that was just a mockup I did by hand from the sequence. I think the default gap penalty for BLAST is -10 or -6. Maybe have a look around at what other common parameters use and try that?

ADD REPLYlink written 3.3 years ago by Joe18k

Thanks! There are a lot of confusions in the alignment functions in Biopython. I currently use Biopython v1.67, and will upgrade to v1.68.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genebow160
1

The current release is Biopython 1.70, but please do try upgrading since the pairwise code was rewritten in Biopython 1.68

ADD REPLYlink written 3.3 years ago by Peter5.9k
1

Have a look in Biopython's API. pairwise2 is complex and allows many options. To get the same results as in EMBOSS' WATER you should use the same parameters, e.g. using align.localds(seq1, seq2, matrix, gap-open-penalty, gap-extend-penalty). In the API is an example how to implement the matrix.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Markus310

Is there a way to make this faster? It seems really slow right now.

ADD REPLYlink written 2.3 years ago by biogirl0410

If you’re looking for speed you’re probably not looking for Biopython.

ADD REPLYlink written 2.3 years ago by Joe18k
0
gravatar for st.ph.n
3.3 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:
#!/usr/bin/env python

query='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
target='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

aln = ''

for n in range(len(query)):
    if query[n] == target[n]:
        aln += '.'
    else:
        aln += query[n]

print '{}\t{}'.format('Query', query)
print '{}\t{}'.format('Aln', aln)
print '{}\t{}'.format('Target', target)
ADD COMMENTlink written 3.3 years ago by st.ph.n2.5k

Thanks for this try, but it may not handle gaps in a sequence alignment.

ADD REPLYlink written 3.3 years ago by genebow160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1323 users visited in the last hour