Question: Correct Method To Blast All-Vs-All With Ncbiblast & How To Speed It Up?
5
gravatar for Tim
9.4 years ago by
Tim330
Nijmegen, the Netherlands
Tim330 wrote:

Hi all,

I'm using ncbi-blast-2.2.24+ (on Ubuntu linux) for a sizable all-vs-all blast of protein sequences (530.000 lines of fasta). This is taking quite a while (over an hour) already, so I'm looking into ways to speed it up.

What I've done is run:

ncbi-blast-2.2.24+/bin/makeblastdb -in good_proteins.fasta -dbtype prot -out my_prot_blast_db

followed by:

ncbi-blast-2.2.24+/bin/blastp -db my_prot_blast_db -query good_proteins.fasta -outfmt 6 -out all-vs-all.tsv -num_threads 4

Now firstly: Is this the correct way to do an all-vs-all blast?

And secondly: How can I speed this up?
I added the -num_threads 4 in hopes of making it use all my four processing cores, but it's just alternating in using 100% of one CPU, with the other three near idle. (Being a CS graduate I'm aware of the distinction between cores & threads, but I didn't see any other configuration option that seemed related: http://www.ncbi.nlm.nih.gov/books/NBK1763/)

Possibly thirdly: It is at all reasonable to expect this all-vs-all blast on such a dataset to run in an manageable amount of time, or should I somehow divide this up / move to supercomputers?

(And maybe fourthy: I just chose ncbi-blast because I thought it'd be a good choice, would any other choice be better in handling this case?)

Best regards, Tim

blast • 21k views
ADD COMMENTlink modified 7.1 years ago by jsporter60 • written 9.4 years ago by Tim330
4
gravatar for Neilfws
9.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:
  1. Yes, that looks good for all-v-all BLAST; your command-line arguments look fine.
  2. There is some discussion of the num_threads issue over at SEQanswers. A comment there suggests that only part of the BLAST+ procedure (word match) is multi-threaded. Someone else suggests that this is an issue with BLAST+ as opposed to the "older" BLAST. That comment rings true with me; I have not used BLAST+ but I recall that the old blastall -a option, -a 4 resulted in 100% usage of 4 cores.
  3. A "manageable" amount of time is different for different people. I'd estimate that a BLAST like this on "average" desktop hardware (say 4 reasonably quick cores and 4-8 GB RAM) would take several hours. For myself, I'd happily let it run and do something else on another server, but that doesn't suit everyone.

    If you want to look at parallel BLAST there are only a few options. One of the main ones used to be mpiBLAST, but I don't know if it works with BLAST+.

  4. BLAST probably is the best option in this case; other options are likely even slower.

I assume you're running something like:

watch -n 5 'wc -l all-vs-all.tsv'

just to keep an eye on progress and get some rough estimate of queries/second processed.

ADD COMMENTlink modified 11 months ago by RamRS28k • written 9.4 years ago by Neilfws48k
2
gravatar for Spitshine
9.4 years ago by
Spitshine640
Esch-sur-Alzette, Luxembourg
Spitshine640 wrote:

The classical way to do this is to split your input data set in smaller chunks and run it against the complete data set, making use of your cores. If you are not interested in the alignment but only the scores, choose tabular output, which should speed things up.

ADD COMMENTlink written 9.4 years ago by Spitshine640

I'll probably have to move in this direction as I was indeed already using the tabular output (-outfmt 6 in the newest blast version).

ADD REPLYlink written 9.4 years ago by Tim330
0
gravatar for jsporter
7.1 years ago by
jsporter60
jsporter60 wrote:

mpiBlast can be used on a distributed memory architecture.

ADD COMMENTlink written 7.1 years ago by jsporter60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour