Question: Truly Parallel Blasts With Blast+
7
gravatar for Eric Normandeau
6.3 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

I find myself once again having to run blast+ programs to blast large amounts of sequences (100,000+) on swissprot, refseq, nr, etc.

blast+ can use multiple cores, but the way it is implemented means that when using multiple cores, cores that terminate their calculation earlier have to wait for the longer calculations before getting new work to do. On 8-16 cores, it means that the processors are idle 2/3+ of the time.

I am thinking of implementing something to solve this problem where a master program would call the blast program on single sequences, launch new blasts as the processors get free, and keep track of the order in which the results should go to combine them. Since the database is cached the first time it is used, I think the overhead of launching a new blast command each time should be minimal. Dealing with some of the formats may end up being somewhat painful (format 0, for example), but for tabular formats (ex: 6), this should be pretty trivial.

The only other alternative I have at the moment is to split the sequence file into n smaller files and run n blasts. I would like to have a solution where all there is to do is to launch a single command, for example:

parallel_blast.py blastp [options] input_file

Before I do that, however, I would like to know if anybody is aware of an existing solution to that problem. I browsed the other blast posts on the forum and am already doing most of the possible suggestions in there. This is not a very difficult problem, but if somebody already produced a quality solution, I'd rather use it than hack something ugly to solve it.

parallel blast+ • 10.0k views
ADD COMMENTlink modified 5.5 years ago by Jeremy Leipzig18k • written 6.3 years ago by Eric Normandeau10k
3

How about this GNU Parallel - parallelize serial command line programs without changing them imo the best tutorial on biostar at the moment. Also shows how to split records.

ADD REPLYlink written 6.3 years ago by Michael Dondrup46k

Thank you Michael. I didn't remember a blast example in there. I will certainly try that. Want to add this comment as an answer? If it works well for me I'll be able to mark it as the correct answer.

ADD REPLYlink written 6.3 years ago by Eric Normandeau10k
4
gravatar for Michael Dondrup
6.3 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

According to this tutorial: GNU Parallel - parallelize serial command line programs without changing them the examples under "Running Blast on multiple machines" should do what you want, haven't tested myself, but it looks like the source is very good

In particular:

  cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

should do what you want. Blocksize might be something to experiment with. Invoking a blast process for each single or only few sequence might have too high overhead for program start and database loading.

I do not know how good the error handling is, but it is important to keep track of processes crashing, which can still happen with blast+. Another issue might be that you will get multiple Blast headers if using the default output format.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Michael Dondrup46k
1

Finally had time to try it. Works like a charm. It does output multiple headers though, one per block treated. For some of the output formats, it makes no difference and for the other, it has no importance for what I want to do. Thanks!

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Eric Normandeau10k

parallel has a --halt-on-error (or just --halt) option which controls how to handle errors in the spawned processes, and could be useful for catching crashed blast runs.  The default error handling will report the number of processes that exited with an error as the exit status for parallel itself.

ADD REPLYlink written 5.4 years ago by Thomas Sibley0

To deal with multiple output headers, you can use something like csplit or chunk the output into separate files rather than sending everything to stdout:

parallel --block 100k --recstart '>' --pipe blastn -query - -html ... \
    < 1gb.fasta \
    | csplit -f results- -b %03d.html - '%^<HTML%' '/^<HTML/' {*}

parallel --recstart '>' -N 1 --pipe blastn -query - -out 'result-{#}.html' -html ... \
    < 100-whole-genomes.fasta

The -N option lets you specify exactly how many records to pipe to each process instead of chunking by size.  Its doc'd to be a little slower (understandably), but I don't think the difference is appreciable compared to the runtimes of blast itself.

ADD REPLYlink written 5.4 years ago by Thomas Sibley0

Thomas, thanks for the -N option. More logical than using chunk sizes in this case.

ADD REPLYlink written 5.2 years ago by Eric Normandeau10k
1
gravatar for Manu Prestat
6.3 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

Hi Eric, I know a lot of people (including me) with their in house solution (using MPI for my part). However, there is a solution on the market http://www.paracel.com/ but as it is "truly" on the market, this has a cost ;-) You also may take a look at "MPI-Blast".

EDIT: I've not looked into MPI-BLAST for a while, and as far as I remember it can split the query AND the database, and then re-computes a relevant e-value (as it depends on the db size). My version is made with python which uses the mpi4py library. I split only the query and distribute as many jobs as the number of "sub-fasta" files, and at the end, the output is gathered into one file. It works only with the tab output, and is adapted to the cluster I use. However, I think it is easily understandable and modifiable for your own use if you wish. I can share it (I put it here : https://code.google.com/p/bioman/source/browse/mpiBlastHopper.py ). This was not intended to be public, so it's not very clean ;-)

I run it like this (aprun is like mpirun for that specific cluster): aprun -n 480 -N 24 mpiBlastHopper.py file.faa mydb bin/blastp "-evalue 1e-4 -num_alignments 5 -num_descriptions 5"

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Manu Prestat3.9k

Hi Manu. While I would not consider paying for blast capabilities, MPI-Blast looks just like what I was looking for. However, development seems to have halted around 2007 and their website now points to a company. Do you have any experience, comment or opinion about MPI-Blast? Would you care describing your own in-house MPI solution and give examples of how to run a blast?

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Eric Normandeau10k

OK, I edited my post.

ADD REPLYlink written 6.3 years ago by Manu Prestat3.9k
1
gravatar for xapple
5.6 years ago by
xapple230
UU
xapple230 wrote:

Here is my in-house solution using Python and a SLURM cluster if anyone is interested:

https://github.com/limno/parallelblast

ADD COMMENTlink written 5.6 years ago by xapple230
1
gravatar for Jeremy Leipzig
5.5 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I run a lot of pairwise blasts for mitomaster and I wrote this node.js wrapper to allow them to run on their own server:

https://www.npmjs.org/package/blast-wrapper

https://github.com/leipzig/blast-wrapper

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Jeremy Leipzig18k
0
gravatar for Zag
5.5 years ago by
Zag10
European Union
Zag10 wrote:

A simple solution could also be splitting your input and running blast with xargs.

Assuming you want to run on 8 cores and your input files are splitted_0.fasta, splitted_1.fasta ... splitted_7.fasta, you can run

seq 0 7 | xargs -I {} blastn -task megablast -query splitted_{}.fasta -db your_database -out tmp_{}.tsv
cat tmp_*.tsv > tmp.tsv
seq 0 7 | xargs -I {} rm splitted_{}.fasta tmp_{}.tsv

See How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File for how to split a fasta in multiple files.

ADD COMMENTlink written 5.5 years ago by Zag10
0
gravatar for dmathog
3 months ago by
dmathog30
dmathog30 wrote:

Bumping an old thread...

For large multicpu machines this script might do what the OP wants:

http://saf.bio.caltech.edu/pub/software/molbio/many_blast_plus_1cpu.sh

syntax: many_blast_plus_1cpu.sh blast_program infile outfile Ncpus blast_params

On machines with more than 40 threads the script was about twice as fast as running the programs directly with -num_threads set to the same number as Ncpus.

Also I recently updated my blastmerge program to blastplustmerge, so it will now work with (slightly modified) blast 2.9.0+. This is a reimplementation of the methods described here:

https://academic.oup.com/bioinformatics/article/19/14/1865/246052

for the current NCBI programs. This is used after running queries against split databases on compute nodes - it reassembles the result into what it would have been had the query been against the whole database at once. Only -fmt 0 is supported. This program and a bunch of related scripts (for setting up and splitting the databases, running the blast jobs) are here:

https://sourceforge.net/projects/parallelblast-plus/

Relatively small nodes may be used since only 1/Nth of the database is needed on each compute node. A web interface is also provided but I would suggest restricting it to internal use. Patches for my method of taxonomy restricted searches, which use two external files per database, are included - these work with either v4 or v5 NCBI blast databases. This method supports internal taxonomy nodes ("rodentia" rather than a list of all rodent taxonomy IDs.) Binaries (Centos 7) using those patches are also available from the URL above. In addition to the taxonomy restriction a method to force in OID ranges was added that does not require the creation of NAL/PAL files. If compute nodes have enough disk/memory to support large databases those may be effectively sliced by setting OID ranges like 1,N on the first, N+1,2N on the second, and so forth. That would eliminate the need to split the databases physically, but the order of the database contents should probably be randomized before using them like this (to avoid blocks of related sequences from a single organism, which may have all been added to the database at the same time, for instance.)

ADD COMMENTlink written 3 months ago by dmathog30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1893 users visited in the last hour