Massive Pairwise Comparison Using Biopython
3
1
Entering edit mode
10.4 years ago
Peter ▴ 10

Hi,

I have a data-set of ~7500 sequences, avg. length ~1700 bases. I need to perform pairwise analysis on the entire set. I have a biopython script to perform this analysis in parallel. My understanding is that the comparison will need to run on an MPI cluster. What are my options for doing this and where could I run the job?

Thanks,

Peter

biopython pairwise • 10k views
ADD COMMENT
3
Entering edit mode

What tool do you want to use to do the comparisons, or is it something you have written yourself?

ADD REPLY
0
Entering edit mode

Hi Peter, I actually work on exactly the same problem. I use mpi4py library and perform pairwise alignment using muscle, clustalw or mafft. My code runs on the cluster, though 2000 sequences takes about 5 hours on 40 cores. Yes you can also use pairwise2 from biopython. If you are still interested please me an email at mubrejnev@gmail.com, so that we can chat about it. Brejnev

ADD REPLY
5
Entering edit mode
10.1 years ago
lh3 32k

As others have commented, you did not mention what types of "pairwise comparison" you are doing [EDIT: just notice that this is a very old question. Anyway...]. I do not know biopython enough about if the part you use is based on native machine code or written in Python. If for your task a pure python module takes most of CPU time, you should seek solutions in C/C++ or Java, which can be 10-100X faster for compute intensive tasks such as Smith-Waterman alignment. If you are actually using Smith-Waterman for comparison, SSE2-based implementation can give you a further 30X speed up, or >1000X speed up over a pure python implementation.

Think about that: you can use your laptop to finish a task that requires a moderate cluster.

EDIT: To support my claim, I did a naive benchmark. The take-home message is biopython might be 10000X slower than a good SW implementation. Sometimes choosing the right algorithm and/or implementation can make a huge difference.

The detailed result is here: [?] Input: human mitochondrial DNA (MT.fa) vs. 2000 150bp reads simulated from it (q.fa) =================================================================================================== Program Time (s) Mem (MB) Command-line


cross_match 0.28 104.280 cross_match q.fa MT.fa -matrix mat-blast # [not SW; time/2] ksw-s1 1.25 0.720 ksw -fs1 MT.fa q.fa # [no aln; query up to 255bp] ksw-s2 2.05 0.720 ksw -fs2 MT.fa q.fa # [SSE2; no aln; 2nd best] exonerate 4.64 14.736 exonerate -m affine:local -S n -d EDNABLAST -o -7 -e -2 q.fa MT.fa # [not SW; time/2] bwa-stdsw 29.66 0.976 bwa stdsw -f MT.fa q.fa # [buggy in rare cases] swat 45.58 5.264 swat q.fa MT.fa -M mat-blast -N 1 -raw # [custom matrix] water 144.52 24.696 water -as MT.fa -bs q.fa -da EDNABLAST -gapo 7 -gape 2 -ou 1 biopython ~22752.8 ~620.784 # [able to output multi opt aln but disabled] =================================================================================================== [?]

Here are some comments on each implementation:

  • cross_match from phrap-080730. Cross_match is not a standard SW implementation, but for intra-species alignment, it should give identical alignment to SW most of time. Cross_match aligns both strands. I halve its CPU time as others are configured to align the forward strand only. Cross_match also finds all the high-score hits, while others do not.

  • swat from phrap. This is largely believed to be the most efficient SW implementation before the invention of SSE2-based ones.

  • exonerate, non-exhaustive mode (not a standard SW).

  • bwa-stdsw. This is a component of BWA. It uses tricks inspired by swat. Stdsw implements a linear-space SW, but I think it is still buggy in very rare cases. The alignment may be slightly different from the optimal. This is actually the first data set where stdsw shows better performance than swat; swat is always faster for other data sets.

  • water from EMBOSS.

  • Biopython. By default, this implementation does more than others in this benchmark in that it finds all the optimal alignments having the same alignment score. I switched off this feature. The core routines are implemented in C so far as I can tell.

  • ksw. My experimental SSE2-based implementation. It computes score, the end coordinates of the optimal alignment and the 2nd best score. It does not give the final alignment or the start coordinates.

Biopython script

from Bio import pairwise2
from Bio import SeqIO
import sys
for t in SeqIO.parse(sys.argv[1], "fasta"):
    for q in SeqIO.parse(sys.argv[2], "fasta"):
        a = pairwise2.align.localms(t.seq, q.seq, 1, -3, -7, -2, one_alignment_only=1, score_only=1)
        print t.id, "\t", q.id, "\t", a
ADD COMMENT
1
Entering edit mode

Doing pairwise alignment within Biopython does not necessarily mean that it really has to use the pairwise2 module. Perhaps you know that water and needle (from EMBOSS) can be run from within Biopython.

However, I think you've presented a nice benchmark here. I wonder if there is a significant overhead for running water from Biopython.

ADD REPLY
1
Entering edit mode

If Biopython is able to call exonerate or cross_match, that would be preferred.

ADD REPLY
0
Entering edit mode

There is also diagonalsw, an SSE4.1 open source implementation of Smith-Waterman. http://diagonalsw.sourceforge.net/

ADD REPLY
0
Entering edit mode

Thanks. I have also found SWPS3 is buggy.

ADD REPLY
0
Entering edit mode

Hello, I have similar thing to do and i was trying this script. The thing is it outputs all scores for one to the other, for instance

gi|321422410|gb|ADIH01013800.1| gi|321371556|gb|AEAQ01026353.1| 245.0 gi|321422410|gb|ADIH01013800.1| gi|321372252|gb|AEAQ01025657.1|1 266.0 gi|321422410|gb|ADIH01013800.1| gi|321393004|gb|AEAQ01005223.1| 307.0 gi|321422409|gb|ADIH01013801.1| gi|321371556|gb|AEAQ01026353.1| 266.0 gi|321422409|gb|ADIH01013801.1| gi|321372252|gb|AEAQ01025657.1|1 299.0 gi|321422409|gb|ADIH01013801.1| gi|321393004|gb|AEAQ01005223.1| 356.0

but i only want it to output the best score for that one to the other, like for insance

gi|321422410|gb|ADIH01013800.1| gi|321393004|gb|AEAQ01005223.1| 307.0 gi|321422409|gb|ADIH01013801.1| gi|321393004|gb|AEAQ01005223.1| 356.0

is there something i could add to get it to do this. i looked up pairwise2 but thing.? thanks

ADD REPLY
1
Entering edit mode
10.4 years ago
Adam ▴ 10

Amazon offers cluster computing services, here

It can get a little costly, so plan carefully. Good luck.

ADD COMMENT
1
Entering edit mode

Depending on what "pairwise analysis" is, this might not be needed. Anyhow, it's hard to judge from the question.

ADD REPLY
0
Entering edit mode
10.1 years ago
Joemar ▴ 10

I have done something like this using PiCloud. I ran pairwise global alignment using the needle program in Emboss (controlled within Biopython) with about a million pairs of protein sequences.

Given a working script and a preconfigured environment to run it, you can invoke their cloud.map() function to achieve massive parallelization.

You may still have to adapt your function a little bit and you have to familiarize yourself with the system but it is a lot less hassle compared to doing this directly in Amazon EC2.

You may want to check them out: http://www.picloud.com/ and http://docs.picloud.com/

ADD COMMENT

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6