Is Blast+ Running As Fast As It Could ?
4
6
Entering edit mode
11.1 years ago
Amr ▴ 160

Hi all,

I am trying to run a tblastn search of nt locally using blast+2.2.25, and was wondering if anyone new if it's going as fast as it should - I have 1000's of queries to get through!

I should mention - at the moment I am blasting via a python script which blasts in batches batches that i define. I've tried 5 sets of 20-query batches, which took and average of 20 mins to run..

whats the slowest part of the blast search? what could i do to speed it up.. the ideal runtime would be 15-30 seconds per query.

Im using 8 cores by the way, with 12 GB of RAM. and ive changed -num_threads to 16, also excluded some gi's..any other ideas?

Thanks to all of you that answered! it's my first post and I'm quite touched that i got this much support from strangers! inspires me to contribute as much as i can!

#!/usr/bin/python

#BLASTing in a way that doesn't crash the webserver!

from time import sleep
from Bio import SeqIO
from Bio.Blast import NCBIWWW
import os
from Bio.Blast.Applications import NcbitblastnCommandline
import time

mult = input('How many queries shall I BLAST each time time?')
recfile= input('\n\nI need the file name containging your genpept records\n\nPlease enclose entry with single quotes:')

########################

def numbsuffix(d):
return 'th' if 11<=d<=13 else {1:'st',2:'nd',3:'rd'}.get(d%10, 'th')

########################

def tentest(number):
if number % mult == 0:
return True
else:
None
########################

# create file for queries

testfile=open('blastqueries.fasta','w')

# parse genpept files and convert to list data structure

all_records = SeqIO.parse(recfile, 'genbank')

#loop through genpeptfiles, create fasta files, send every N items for blasting in user defined batches

count=0
which_batch=0

for each_record in all_records:
ids = each_record.id
sequences = each_record.seq
testfile.write('>%s\n%s\n'%(ids, sequences))
count += 1
if  tentest(count) is True:
testfile.close()
which_batch += 1
if os.path.exists('Blast_Results/blast_out_%s.xml' %count):
none
else:
print('\nreached %s%s multiple of %s.\n\nRunning BLAST, Please Wait...' %(which_batch, numbsuffix(which_batch), mult))
y= time.strftime('%s')
try:
os.system('bash Remoteblast.sh blastqueries.txt Blast_Results/blast_out_%s.xml' %count)
x= time.strftime('%s')
secs = int(x)-int(y)
timediff = time.strftime('%H:%M:%S', time.gmtime(secs))
print 'Just BLASTed, it took this much time:', timediff
except:
sys.exit()
testfile=open('blastqueries.txt','w')

blast blast memory • 7.6k views
1
Entering edit mode

on second thought - maybe i could split single core blast searches onto all cores using your forking in python idea. Manu Prestat's comment bellow, if i understand it correctly seems to suggest this..

0
Entering edit mode

sudo renice -20 tblastn_pid to increase process priority to high!?

0
Entering edit mode

See http://docs.python.org/library/subprocess.html for Python parallel processing library too!

0
Entering edit mode

See http://docs.python.org/library/subprocess.html for forking parallel processes in Python or here for more specific parallel modules http://wiki.python.org/moin/ParallelProcessing

0
Entering edit mode

will definately try the priority high thing, although the blasting is happening via os.system() so python parallel processes shouldn't be relevant right?

0
Entering edit mode

You need to do a "ps aux | grep tblastn" first to get the pid and then "sudo renice -20 pid" ;)

0
Entering edit mode

Having a framework setup to run applications in parallel is always useful too :)

0
Entering edit mode

I really got interested in the script! Could share it?

0
Entering edit mode

haha yes sure, but word of warning! this is my first script EVER. so it might be buggy and inefficient!

0
Entering edit mode

Thanks, I'm sure it will useful for a lot of people :) And as time passes I'm sure the community will make it robust and efficient!

0
Entering edit mode

Good point, what do you think of it?

0
Entering edit mode

I am trying to make fast blast with the following script...it is showing following error

Can't exec "makeblastdb": No such file or directory at blast.pl line 42, <GEN0> line 42132.

################################################################

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
use Parallel::ForkManager;
use Time::Local;
#############################
#USAGE: perl script.pl blastmethod directory_of_fasta_files eval outfmt number_of_cpus
#e.g. perl blastplus_parallel.pl blastn SampleDIR e-10 6 outdir 10
#############################
# author: Ulrike Loeber (uloeber@uni-potsdam.de)
#This script takes care of imperfect parallelization of blast+. It splits the files to do smaller jobs on more than one CPU to improve the performance of BLAST+;

my $blast_method =$ARGV[0];
my $seq_dir =$ARGV[1];
my $evalue =$ARGV[2];
my $outfmt =$ARGV[3];
my $outdir =$ARGV[4];
my $cpus =$ARGV[5];

opendir (INDIR, $seq_dir) or die$!;
my @files=grep /\.fasta$/ , readdir (INDIR); #greps every file in the determined directory which end with "fasta" close INDIR; #one cpu is used for perl, so the number of cpus left is$seq_dir-1
my $numberOfProcesses=($cpus-1);
my $subsets=$numberOfProcesses; #build as many subsets as free cpus
my $manager = new Parallel::ForkManager($numberOfProcesses );
foreach my $file(@files){ my$time=localtime();
print "#########PROCESSING##########\n $file\t$time\n";
my $input= Bio::SeqIO-> new( - file => "$seq_dir/$file", -format => "fasta"); my$seq;
my @seq_array;
while( $seq =$input->next_seq() ) {
push(@seq_array,$seq); } my$numberofsequences=@seq_array;
system "makeblastdb -dbtype nucl -in $seq_dir/$file";
my $loops=$numberofsequences/$subsets; #is 1/(times) of the number of sequences for (my$j=0;$j<$subsets;$j++){ #creates as many files as subsets to build and loops as many times x open (OUTFILE , ">$seq_dir/subset_$j\_$file") or die $!; #creates a file which is named like the infile with subset_ in front of it for (my$i=$j*$loops;$i<=((($j+1)*$loops)-1);$i++){ #loops through 1/x of the sequences
$seq=$seq_array[$i]; my$id=$seq->id(); my$sequence=$seq->seq(); print OUTFILE ">$id\n$sequence\n"; } close OUTFILE;$manager->start and next;
system "$blast_method -query$seq_dir/subset_$j\_$file -db $seq_dir/$file -evalue $evalue -outfmt$outfmt -out $outdir.subset_$j\_$file.blast ";$manager->finish;
}
print "#########END##########\n $file\t$time\n";
}
#cleaning up directory
$manager->wait_all_children; foreach my$file(@files){
my $outfile=$file;
$outfile=~s/fasta/blast/g; system "touch$outdir/$outfile"; #creates one outfile per fasta file for (my$j=0;$j<$subsets;$j++){ system "cat$outdir/subset_$j\_$file.blast >>$outdir/$outfile"; #concatenates subfile results to one blast result
system "rm $outdir/subset_$j\_$file.blast"; #removes subset blast results system "rm$outdir/subset_$j\_$file"; #removes data subsets
}
system "rm $outdir/$file.nhr";
system "rm $outdir/$file.nin";
system "rm $outdir/$file.nsq";
#print "#########COMPLETE##########\n $file\t$time\n";
}

0
Entering edit mode

It simply seems that the blast tool "makeblastdb" is not installed or your shell/Perl does not know where to find it. Can you run/execute "makeblastdb" from you command line manually? If not, solve that problem first by adding the BLAST tools to your PATH as explained in the BLAST documentation.

0
Entering edit mode

Now, the error are..

Warning: [tblastn] Query is Empty!
cat: /home/sekhwalm/Oryza/blast//subset_0_dna_rm.fasta.blast: No such file or directory
rm: cannot remove '/home/sekhwalm/Oryza/blast//subset_0_dna_rm.fasta.blast': No such file or directory

0
Entering edit mode

Hm, it seems that the pre-processing is not done correctly and the sub-files are not created correctly, i.e., they are empty but given your information, I can not verify why yet.

1. Question: Can you give me the exact command line and arguments with that you execute the script?

2. Question: Did you recognize that the script searches for input files with the suffix ".fasta" in the given directory? Do your input files match that requirement?

3. Question (as I just had a look into the code): did you realize, that the script matches the input queries against themselves and not against an external BLAST database? Is this really what you want?

10
Entering edit mode
11.1 years ago
Yannick Wurm ★ 2.4k

To increase BLAST speed, try increasing specificity:

• Can you make the evalue cutoff more strict? (-evalue 1.0e-10)
• How many results do you need? (perhaps one is enough?) -max_target_seqs 1
• Alignment calculation is super slow. Do you need to output (and calculate) them alignments? (if not, use table output format). Or how many alignments are enought (1, 5 10? Probably not all 250) -num_alignments 1
• bigger -word_size makes things faster
2
Entering edit mode

yes I blast calculates something approximate during search (to see the order of magnitude of the score). Subsequently if you ask it to output it calculates a "high-quality" alignment. Some info is in Fig 1 and the text below it in the blastplus publication (http://goo.gl/9UBhE )

0
Entering edit mode

i need al results unfortunately. also i thought that evalue wouldnt change the runtime because it will still perform the search right? ah, unless that will save me alignment time..

as for num_alignments, I could always leave the alignments for later, after ive parsed my results and thrown away what i dont need..

0
Entering edit mode

Are you sure that -evalue -max_target_seqs and -num_alignments are so important? I thought it had only an involvement on the writing process to an output file, but not on the alignment itself so that the gain of time was tiny... I used to not restrict so much the output and to filter the output afterward if needed.

0
Entering edit mode

thats what i thought too Manu, but maybe but alignment is happening after the search rather than at the same time..so by restricting this, you save time?

5
Entering edit mode
11.1 years ago
ALchEmiXt ★ 1.9k

+1 for RAM; that is what I would suggest at first.

The 6 frame translation of all sequences takes significant time; So why using a tblastn? It needs to translate the nt DB first before using it with protein queries. It would speed up if you could do a blastn (nt to nt) or if you are not stuck on redunadant sequences: use blastp of your queries against NR.

0
Entering edit mode

blastp versus nr is a good point

0
Entering edit mode

I need to check all frames because im looking for very distant homologs most of which are probably in noncoding regions..I should say, nt is just for practice, eventually im going to check all the databases.

so does it keep the 6 frames in memory!? perhaps I could pre-translate the whole database into the 6 frames and search with blastp..

0
Entering edit mode

hhhmm still don't get it: searching for distant protein sequences in non-coding regions.....you mean like in pseudogenes? Else protein and non-coding don't fit :-)

0
Entering edit mode

yea basically, pseudogenes is a good proxy for what im doing

1
Entering edit mode
11.1 years ago

According to your last comment, you could also try to restrict your database to non-coding regions, after filtering nt.

0
Entering edit mode

ya, but i still need to check coding as well..

0
Entering edit mode

so why not tblastn vs non coding nt, and blastp vs nr?

0
Entering edit mode

(ok, the e-values would not be comparable I agree ;-) )

0
Entering edit mode

thats still a cool idea!

0
Entering edit mode

what do you think about pre-translating nt and doing a blastp...the db would be huuuge

0
Entering edit mode

Well, it wouldn't be different as doing a tblastn actually.

0
Entering edit mode
11.1 years ago

First of all I would advise to execute 16 times one-core-blast at a time instead of using 16 cores for each query (if you work on one node only you won't need to deal with MPI libraries). Second, if you want to take advantage of the 16 cores, you should install at least the triple of RAM. At last, blast speed is very sensitive to the word_size argument.

0
Entering edit mode

sorry, made a typo- i have 8 cores but running with num_threads 16 becuase i was told that the computer can create 'virtual cores'.

also im really new to computing and im not sure i understand your suggestion..

"16 times one-core-blast at a time instead of using 16 cores (if you work on one node only you won't need to deal with MPI libraries"

as in run do you mean run with num_threads at 1, and run blast 8 times simultaneously?

might change words size a bit, but i dont want to loose too many results.. maybe go up to 4 or 5? or is that excessive?

0
Entering edit mode

Yes, you understood my point. gt splitfasta -numfiles 8 seqs.fasta tblastn -query seqs.fasta.1 -db nt ... & tblastn -query seqs.fasta.[...] -db nt ... & tblastn -query seqs.fasta.8 -db nt ... &

If a tab output, cat *.blast > myresults.blast

0
Entering edit mode

Yes, you understood my point.

gt splitfasta -numfiles 8 seqs.fasta tblastn -query seqs.fasta.1 -db nt ... & tblastn -query seqs.fasta.[...] -db nt ... & tblastn -query seqs.fasta.8 -db nt ... & If a tab output, cat *.blast > myresults.blast

0
Entering edit mode

Yes, you understood my point. \$gt splitfasta -numfiles 8 seqs.fasta then tblastn -query seqs.fasta.1 -db nt ... & then tblastn -query seqs.fasta.[...] -db nt ... & then tblastn -query seqs.fasta.8 -db nt ... then If a tab output, cat *.blast > myresults.blast

0
Entering edit mode

Yes, you understood my point.

0
Entering edit mode

Yes, you understood my point. 4 or 5 is not excessive if it helps to have what you need in an acceptable time. I would remind you that the default behavior of blastn for instance, is a megablast task, that is word_size=22, which is more than a 7 in a protein search...