Correct Method To Blast All-Vs-All With Ncbiblast & How To Speed It Up?
Entering edit mode
13.4 years ago
Tim ▴ 350

Hi all,

I'm using ncbi-blast-2.2.24+ (on Ubuntu linux) for a sizable all-vs-all blast of protein sequences (530.000 lines of fasta). This is taking quite a while (over an hour) already, so I'm looking into ways to speed it up.

What I've done is run:

ncbi-blast-2.2.24+/bin/makeblastdb -in good_proteins.fasta -dbtype prot -out my_prot_blast_db

followed by:

ncbi-blast-2.2.24+/bin/blastp -db my_prot_blast_db -query good_proteins.fasta -outfmt 6 -out all-vs-all.tsv -num_threads 4

Now firstly: Is this the correct way to do an all-vs-all blast?

And secondly: How can I speed this up?
I added the -num_threads 4 in hopes of making it use all my four processing cores, but it's just alternating in using 100% of one CPU, with the other three near idle. (Being a CS graduate I'm aware of the distinction between cores & threads, but I didn't see any other configuration option that seemed related:

Possibly thirdly: It is at all reasonable to expect this all-vs-all blast on such a dataset to run in an manageable amount of time, or should I somehow divide this up / move to supercomputers?

(And maybe fourthy: I just chose ncbi-blast because I thought it'd be a good choice, would any other choice be better in handling this case?)

Best regards, Tim

blast blast • 28k views
Entering edit mode
13.4 years ago
Neilfws 49k
  1. Yes, that looks good for all-v-all BLAST; your command-line arguments look fine.
  2. There is some discussion of the num_threads issue over at SEQanswers. A comment there suggests that only part of the BLAST+ procedure (word match) is multi-threaded. Someone else suggests that this is an issue with BLAST+ as opposed to the "older" BLAST. That comment rings true with me; I have not used BLAST+ but I recall that the old blastall -a option, -a 4 resulted in 100% usage of 4 cores.
  3. A "manageable" amount of time is different for different people. I'd estimate that a BLAST like this on "average" desktop hardware (say 4 reasonably quick cores and 4-8 GB RAM) would take several hours. For myself, I'd happily let it run and do something else on another server, but that doesn't suit everyone.

    If you want to look at parallel BLAST there are only a few options. One of the main ones used to be mpiBLAST, but I don't know if it works with BLAST+.

  4. BLAST probably is the best option in this case; other options are likely even slower.

I assume you're running something like:

watch -n 5 'wc -l all-vs-all.tsv'

just to keep an eye on progress and get some rough estimate of queries/second processed.

Entering edit mode
13.4 years ago
Spitshine ▴ 660

The classical way to do this is to split your input data set in smaller chunks and run it against the complete data set, making use of your cores. If you are not interested in the alignment but only the scores, choose tabular output, which should speed things up.

Entering edit mode

I'll probably have to move in this direction as I was indeed already using the tabular output (-outfmt 6 in the newest blast version).

Entering edit mode
11.0 years ago
jsporter ▴ 60

mpiBlast can be used on a distributed memory architecture.


Login before adding your answer.

Traffic: 3479 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6