Question: Biopython pairwise sequence alignment behavior change between 1.59 and 1.76
0
gravatar for sanner
23 days ago by
sanner20
sanner20 wrote:

Hello

I am in the process of porting code from Python 2.7 to 3.7 and I am running into some differences in BioPython

The following 3 lines of code:

from Bio import pairwise2
aln = pairwise2.align.globalxx('EPQYEEIPIYL', 'EPQ?EEIPIYL', one_alignment_only=True)[0]
print(aln)

on Python 2.7 with Bio v 1.59 yield:

('EPQYEEIPIYL', 'EPQ?EEIPIYL', 10.0, 0, 11)

while on 3.7 (using miniconda, Bio v 1.76) I get

('EPQY-EEIPIYL', 'EPQ-?EEIPIYL', 10.0, 0, 12)

creating a spurious gap in the alignment

Thanks for any help with that

-Michel

ADD COMMENTlink modified 13 days ago • written 23 days ago by sanner20

not sure what is your question, but seems it's an issue to report to BioPython developers

ADD REPLYlink written 23 days ago by JC11k

1.59 is pretty ancient, and the entire pairwise2 module was overhauled in, I think, 1.7. I'm not sure how it handles ? Characters now, have you tried without the ?. I vaguely recall possibly needing to specify the ambiguous character.

ADD REPLYlink modified 22 days ago • written 22 days ago by Joe18k

Thanks Joe I replaced ? by X which should stand for 'ANY' in the FASTA format but I get the same result.

JC, My issue is that I expect pairwise2.align.globalxx('EPQYEEIPIYL', 'EPQXEEIPIYL') to return a 1 to one mapping with Y mapped to X as it use to do in 1.7 instead of creating a insertion in both chains to avoid this mapping from occuring.

-Michel

ADD REPLYlink written 21 days ago by sanner20

You could try updating again to 1.8. I haven't looked at changelogs, but perhaps something got broken in 1.78. Otherwise this is probably an issue to report on their GitHub

ADD REPLYlink written 20 days ago by Joe18k
1

I tried with 1.9 and get the same behavior so I did submit an issue to github. Thanks

ADD REPLYlink written 13 days ago by sanner20
1
gravatar for sanner
13 days ago by
sanner20
sanner20 wrote:

Markus Piotrowski gave a nice explanation and solution

The aim of a pairwise alignment algorithm is to achieve an alignment with the best score (or the least costs). From the view of the algorithm (here Needleman-Wunsch-Gotoh) the situation "one gap in seqA followed by one gap in seqB" is not forbidden. The fact that the older implementation of this algorithm in Biopython's pairwise2 did not allow this, was a shortcoming and was criticized e.g. in this paper. That was the reason for re-writing pairwise2 for Biopython 1.68 (in 2016) (with the side effect, however, unrelated to the original issue, of increased performance regarding speed and memory usage). As you see, the alignment with the "spurious insertion" in your example has the same score as the other one, and in fact, the other alignment is also produced (would have index [1]).

This behavior is documented in the docstring of the module which can be read in our API documentation, in detail:

Depending on the penalties, a gap in one sequence may be followed by a gap in the other sequence. If you don’t like this behaviour, increase the gap-open penalty:

>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1):
    print(format_alignment(*a))

A-

-T
  Score=-2

>>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1):
    print(format_alignment(*a))
A
.
T
  Score=-4

This also explains how you can prevent this: by increasing the gap penalty (two gaps should be more expensive than one mismatch). In fact, that is what most "standard settings" in alignment programs do.

Your example shows an amino acid alignment and I doubt that globalxx comes close to meaningful alignments in such cases (with 'real' proteins). You might consider using a substitution matrix:

>>> from Bio import pairwise2 as pw
>>> from Bio.Align import substitution_matrices

Warning (from warnings module): ...

>>> blosum62 = substitution_matrices.load("BLOSUM62")
>>> aln = pw.align.globalds("EPQYEEIPIYL", "EPQ*EEIPIYL", blosum62, -10, -0.5)[0]  # replacing your ? with a * which is in the matrix
>>> print(pw.format_alignment(*aln))
EPQYEEIPIYL
|||.|||||||
EPQ*EEIPIYL
  Score=49

Alternatively, if you want to keep the simple matching scheme, use globalms to set different gap parameters:

>>> aln = pw.align.globalms("EPQYEEIPIYL", "EPQ?EEIPIYL", 1, -1, -2, -.5)[0]
# identical to: aln = pw.align.globalms("EPQYEEIPIYL", "EPQ?EEIPIYL", match=1, mismatch=-1, open=-2, extend=-.5)[0]
>>> print(pw.format_alignment(*aln))
EPQYEEIPIYL
|||.|||||||
EPQ?EEIPIYL
  Score=9

Finally, if you are doing many pairwise alignments, you should consider using our new PairwiseAligner which has a much better performance and other advantages (the latter because it is implemented as a class).

ADD COMMENTlink modified 13 days ago by genomax91k • written 13 days ago by sanner20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1690 users visited in the last hour