Forum:Asking for feedback on a Python library for computing alignments
2
3
Entering edit mode
4 months ago

Hello there, I'm the author of the Python library "pyalign" for computing alignments, and I would be interested in getting feedback from experts on alignment algorithms in terms of the design of the library (since I'm not an expert myself).

In short, I would be interested in learning whether there are any glaring mistakes/omissions in the library's public API.

The library is https://github.com/poke1024/pyalign. It supports global, local and semiglobal alignments, various gap costs and implements the usual classical algorithms (Smith-Waterman, Needleman-Wunsch, Gotoh for affine alignments). The computations are done in a C++17 backend, that heavily relies of templates for code optimization. In terms of single-thread CPU performance (i.e. not comparing to more advanced SIMD or GPU implementations) it should be pretty fast.

Note that the library was not built for a bioinformatics context, but for users who need versatile alignment algorithms for other domains.

A more in-detail notebook is available under https://mybinder.org/v2/gh/poke1024/pyalign-demo/HEAD?filepath=example.ipynb

In functionality, the library is mostly similar to https://biopython.org/docs/1.75/api/Bio.pairwise2.html

One special feature of pyalign (and the main reason for its existence) is that it can deal with large alphabets, like millions of different letters.

alignment Python • 1.0k views
5
Entering edit mode
4 months ago
Mensur Dlakic ★ 18k

Good job with this package. I will definitely use it in teaching, and it will likely be helpful to the community. There are many questions on this forum about local/global pairwise alignments.

I like the matrix visualization option. Still, the jupiter output is weird.

INDU    STRY
||      ||
IN  TEREST


This does not pass the eye test as to how global alignments should look like. Instead:

INDU----STRY
||      ||
IN--TEREST--


Now, even though the alignment above and the one below are for practical purposes equivalent, based on the traceback matrix in your GitHub page the correct jupiter output should be:

IN----DUSTRY
||      ||
INTERE--ST--


In your benchmark graphs, figure legends are covering up relevant parts of the plot. Manually setting the Y-axis to larger values would open up some space so that the legend can fit. Or make the legend letters smaller.

0
Entering edit mode

Hello, thank you for the great feedback. I've changed the code for alignment output to the format you describe (which came down to flipping the way I read the traceback matrix) - this is now also reflected in the README example. I really appreciate getting this kind of information. The graphs are now formatted differently, so the bars are no longer hidden.

1
Entering edit mode
4 months ago

I want to point out another implementation that has Python bindings, parasail:

https://github.com/jeffdaily/parasail

### Installation

Information is severely lacking, I had to figure out myself what needs to be installed using mamba, it was quite annoying to get it to compile, the setup.py is not listing and reporting the dependencies.

### On benchmarking

When you benchmark you should compare to the BioPython pairwise aligner. Rarely (if ever) would one use a pure Python aligner. As it turns out the BioPython implementation is quite performant - is available on Windows as well, and in my evaluation, it was faster than parasail - a library optimized for speed!

It is a high bar to pass! Kudos to the BioPython folks (Michiel de Hoon specifically if I am not mistaken)!

### Tacit limitations, unexpected resource usage

I tried to run pyalign to see if it was suitable for aligning SARS-COV-2 genomes and got this:

RuntimeError: requested maximum length 32768 for s exceeds maximum supported sequence length in this implementation 16383


I think a limitation of 16K is too low for most usecases. But then I tried to cut the sequence down to under 16K and I still got the error above???

The longest sequence I could align was 8000K took about 8 seconds and seems to use 12 gigabytes of memory seems to have a hard time even releasing all that memory, the process hangs a bit at the end.

Notably, however, the runtime is much slower than the same alignment produced via the pairwise function in BioPython (less than 2 seconds) with no noticeable memory usage.

0
Entering edit mode

Thank you for taking the time to look at this and try this out.

I'm sorry for the installation trouble. I've added a new section to the README that details the conda requirements (xtensor mainly). I also hope that the pip install works on all platforms.

I've added some basic benchmarks for biopython's pairwise2 and parasail. Parasail is basically blazing away in my random test cases by a factor of 10. From what I see in the code, it does not even allocate a full matrix, but only uses two rows (columns?), which gives it much better cache coherence? (just an assumption). Anyway, totally impressive, given that I used the standard (non-SIMD) case (parasail.sw).

Note that pyalign is not trying to compete in speed with established bioinformatics libraries, which would be a vain endeavour indeed. It's more about having versatile options for alphabets and scores. For example, parasail only supports byte strings, thus limiting the alphabet size to 256. Similarly, scores are often limited to integers.

I've changed the implementation in pyalign to support larger sequences (basically 32-bit indices now). I'm at a loss currently to explain what caused the 12 gigabytes memory hog you experienced - there is currently an O(nm) allocation for the matrix, which can go up to 4nm for optimization purposes, but that should be it. I will need to look into dealing with large sequences more in the coming weeks. I cannot explain why it was considerably slower than biopython. For the toy examples I came up with (seq length about 1000), pyalign ran in roughly the same performance magnitude as biopython.

Thank you again for trying this out and giving feedback. This is valuable.

0
Entering edit mode

I would suggest getting a longer sequence and testing with that, since SARS-COV-2 is so heavily studied (but still quite small) 29kb you could start with that.

https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta

I think your testing is not quite right, the benchmark problem is way too simple. When an alignment takes microseconds to complete it is not the alignment library that gets tested, more like the speed of the bindings to the library, or some other overhead.

In my benchmarks when aligning two genomes of SARS-COV2 the Biopython.pairwise2 was substantially faster than both pyalign and parasail.

0
Entering edit mode

Using (parts of) the SARS-COV-2 genome, I put together a little notebook to produce some timings:

If I'm doing something wrong there, I would be grateful to get an idea what it is.

# runtimes

I limited the alignment to the first 5000 bases vs. the following 5000 bases (see cell 4), because I was not able to run a larger alignment with biopython (and pyalign) in reasonable (< 1m) time. Here are the numbers from the notebook:

test_parasail time: 0.1446s score: 2089
test_biopython time: 13.4976s score: 2089.0
test_pyalign_general time: 20.8568s score: 2089.0
test_pyalign_alphabet time: 3.0250s score: 2089.0


All four solutions correctly compute the score.

# general vs alphabetic mode in pyalign

The runtime data you saw with pyalign is probably from the "general" mode. There are two different modes.

Mode A (pyalign.problems.general) assumes that there is an infinitely sized alphabet in the sequences, and that for two sequences a and b, the scores between a[i] and b[j] are in no way structured (i.e. based on a small score matrix).

Mode B (pyalign.problem.alphabetic ) corresponds to nucletoide alignments, where one can assume a small alphabet of finite size. The runtimes there are much better.

0
Entering edit mode

Thanks for following up with the code. I will use your test code below.

I actually misspoke and agreed that pairwise2 is the right way to generate alignments in BioPython.

Actually, pairswise2 is deprecated and should not be used.

The efficient way to use alignments is via:

from Bio.Align import PairwiseAligner


in that case the code below:

import parasail
from Bio import pairwise2
from Bio.Align import PairwiseAligner
import time

a = "ACGT" * 1000

def measure_time(f):
t0 = time.time()
r = f()
t1 = time.time()
print(f.__name__, f"time: {t1 - t0:.4f}s", f"score: {r}")

def test_parasail():
matrix = parasail.matrix_create("ACGT", 1, 0)
return parasail.sw(a, a, 1, 1, matrix).score

def test_biopython_1():
score = pairwise2.align.localxs(
a, a, open=-1, extend=-1, one_alignment_only=True)[0].score
return score

def test_biopython_2():
aligner = PairwiseAligner()
alns = aligner.align(a, a)
return alns[0].score

measure_time(test_biopython_1)
measure_time(test_parasail)
measure_time(test_biopython_2)


will produce:

test_biopython_1 time: 5.7560s score: 4000.0
test_parasail time: 0.0671s score: 4000
test_biopython_2 time: 0.0580s score: 4000.0


note how biopython is faster than parasail.

Parasail most likely could be tuned to be faster than biopython, it has many options that I don't fully understand. I needed a generic aligner that worked in different circumstances for both global and local alignments, with both nucleotides and peptides, with a tunable penalties even for gaps at start and end, etc

It seemed that BioPython did a good job right away with no special tuning needed.

0
Entering edit mode

Oh, now I remember the problem with parasail, if one needs a trace (actual alignment, not just the score) parasail was slower than biopython.

0
Entering edit mode

Ok, thanks for sharing, these are interesting timings. And thanks for hinting me to PairwiseAligner, I wasn't aware of it.