How to display mismatched sequences from alignment when using Biopython
2
0
Entering edit mode
4.5 years ago
genebow ▴ 160

When using Biopython alignment to align two protein sequences, the output does not show the mismatched amino acids. Would you please have some suggestions to display the mismatched amino acids by Biopython?

The following is the code for alignment by Biopython. I tried different options, the problem could not be resolved.

----------------------------------------------------------

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)
print('Score',alignments[0][2])
print(format_alignment(*alignments[0]))


----------------------------------------------------------------

The program output and the output from EMBI-EBI EMBOSS Water (Protein Alignment) are shown as follows:

Biopython:
HRTLGLLLHTQVSI-QQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HRTLGLLLHTQVSIKQ-LLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK

EMBL-EBI:
EMBOSS_001         1 HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYW     50
||||||||||||||:||||||||||||||||||.||||||||||||||||
EMBOSS_001         1 HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYW     50

EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK     82
|||||||||||||||||||||||||||||:||
EMBOSS_001        51 KLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK     82


and in this picture: output

alignment biopython • 5.8k views
1
Entering edit mode

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

Thanks! I will use this approach for formatting code.

0
Entering edit mode

Please say which version of Biopython you are using, since the pairwise code was rewritten in Biopython 1.68

2
Entering edit mode
4.5 years ago
Joe 19k

Here's a bit of a bodge that might work. Note though that it doesn't take in to consideration amino acid similarity. It would be easy enough to code in with a few more elif statements though provided you can put together a table of what matches with what.

In this instance it just leaves a blank where there's no match, by simply iterating both strings from pairwise2 and checking equivalency. You could create a slightly fancier print output too if desired but I think this works.

In essence all you need to do is create a custom string printed in the middle of the 2 alignments. It's generated by appending to a list, and then joining the list to make another string. Standard use of monospaced fonts in terminals etc means that the alignments of all the character columns is taken care of for you.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

proteinSeqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
proteinSeqseqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

alignments = pairwise2.align.localxx(proteinSeqQuery,proteinSeqseqFound)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
if a == b:
match.append('|')
else:
match.append(' ')

print(alignments[0][0])
print("".join(match))
print(alignments[0][1])


Yields:

HRTLGLLLHTQVSIQ-QLLKLPAECFHPKPKVNSV-LIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKH-AK
||||||||||||||  ||||||||||||||||||  |||||||||||||||||||||||||||||||||||||||||||||  ||
HRTLGLLLHTQVSI-KQLLKLPAECFHPKPKVNS-ALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMK-YAK

0
Entering edit mode

It works! Thank you very much!

0
Entering edit mode

Great, if this answers your question satisfactorily, be sure to accept the answer (tick on the left hand side of the post).

0
Entering edit mode

Also I'd maybe play around with the gap penalty and mismatch penalty a little bit since that alignment seems to be scoring gaps quite low.

For instance, it looks to me like the 2 gaps here should instead just be a mismatch, but I could be wrong...

   QVSIQ-QLLKL
||||  |||||
QVSI-KQLLKL


Would make more sense to me as:

   QVSIQQLLKL
|||| |||||
QVSIKQLLKL

1
Entering edit mode

Thank you! yes, that is a problem. What gap penalty do you use in localxx() for the correct result (second one)? It looks localms() function should be used. The following is the improved version.

from Bio import pairwise2

seqQuery='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
seqFound='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

# Four parameters (2,-1,-1,-1) in alignment:Identical characters are given 2 points, 1 point is deducted for each non-identical character.
# 1 point is deducted when opening a gap, and 1 point is deducted when extending it.

alignments = pairwise2.align.localms(seqQuery,seqFound, 2, -1, -1, -1)

match = []

for a, b in zip(alignments[0][0],alignments[0][1]):
if a == b:
match.append('|')
else:
match.append('*')

m="".join(match)
s=[]
s.append(alignments[0][0]+'\n')
s.append(m+'\n')
s.append(alignments[0][1])

alignedSeqs="".join(s)
print('\n')
print(alignedSeqs)


The output is as follows:

HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK
||||||||||||||*||||||||||||||||||*|||||||||||||||||||||||||||||||||||||||||||||*||
HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK

1
Entering edit mode

Yep that looks to be more reasonable. Opening a gap is usually the most heavily penalised action AFAIK, so that makes for a much more parsimonious/high scoring alignment.

0
Entering edit mode

I didn't, that was just a mockup I did by hand from the sequence. I think the default gap penalty for BLAST is -10 or -6. Maybe have a look around at what other common parameters use and try that?

0
Entering edit mode

Thanks! There are a lot of confusions in the alignment functions in Biopython. I currently use Biopython v1.67, and will upgrade to v1.68.

1
Entering edit mode

The current release is Biopython 1.70, but please do try upgrading since the pairwise code was rewritten in Biopython 1.68

1
Entering edit mode

Have a look in Biopython's API. pairwise2 is complex and allows many options. To get the same results as in EMBOSS' WATER you should use the same parameters, e.g. using align.localds(seq1, seq2, matrix, gap-open-penalty, gap-extend-penalty). In the API is an example how to implement the matrix.

0
Entering edit mode

Is there a way to make this faster? It seems really slow right now.

0
Entering edit mode

If you’re looking for speed you’re probably not looking for Biopython.

0
Entering edit mode
4.5 years ago
st.ph.n ★ 2.6k
#!/usr/bin/env python

query='HRTLGLLLHTQVSIQQLLKLPAECFHPKPKVNSVLIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKHAK'
target='HRTLGLLLHTQVSIKQLLKLPAECFHPKPKVNSALIKLTRHTTDVPDKYWKLYTYFVSKWVNREYRQLFTKNQFHQAMKYAK'

aln = ''

for n in range(len(query)):
if query[n] == target[n]:
aln += '.'
else:
aln += query[n]

print '{}\t{}'.format('Query', query)
print '{}\t{}'.format('Aln', aln)
print '{}\t{}'.format('Target', target)

0
Entering edit mode

Thanks for this try, but it may not handle gaps in a sequence alignment.