Hi all,
I am trying to run a tblastn search of nt locally using blast+2.2.25, and was wondering if anyone new if it's going as fast as it should - I have 1000's of queries to get through!
I should mention - at the moment I am blasting via a python script which blasts in batches batches that i define. I've tried 5 sets of 20-query batches, which took and average of 20 mins to run..
whats the slowest part of the blast search? what could i do to speed it up.. the ideal runtime would be 15-30 seconds per query.
Im using 8 cores by the way, with 12 GB of RAM. and ive changed -num_threads to 16, also excluded some gi's..any other ideas?
Thanks to all of you that answered! it's my first post and I'm quite touched that i got this much support from strangers! inspires me to contribute as much as i can!
#!/usr/bin/python
#BLASTing in a way that doesn't crash the webserver!
from time import sleep
from Bio import SeqIO
from Bio.Blast import NCBIWWW
import os
from Bio.Blast.Applications import NcbitblastnCommandline
import time
mult = input('How many queries shall I BLAST each time time?')
recfile= input('\n\nI need the file name containging your genpept records\n\nPlease enclose entry with single quotes:')
########################
def numbsuffix(d):
return 'th' if 11<=d<=13 else {1:'st',2:'nd',3:'rd'}.get(d%10, 'th')
########################
def tentest(number):
if number % mult == 0:
return True
else:
None
########################
# create file for queries
testfile=open('blastqueries.fasta','w')
# parse genpept files and convert to list data structure
all_records = SeqIO.parse(recfile, 'genbank')
#loop through genpeptfiles, create fasta files, send every N items for blasting in user defined batches
count=0
which_batch=0
for each_record in all_records:
ids = each_record.id
sequences = each_record.seq
testfile.write('>%s\n%s\n'%(ids, sequences))
count += 1
if tentest(count) is True:
testfile.close()
which_batch += 1
if os.path.exists('Blast_Results/blast_out_%s.xml' %count):
none
else:
print('\nreached %s%s multiple of %s.\n\nRunning BLAST, Please Wait...' %(which_batch, numbsuffix(which_batch), mult))
y= time.strftime('%s')
try:
os.system('bash Remoteblast.sh blastqueries.txt Blast_Results/blast_out_%s.xml' %count)
x= time.strftime('%s')
secs = int(x)-int(y)
timediff = time.strftime('%H:%M:%S', time.gmtime(secs))
print 'Just BLASTed, it took this much time:', timediff
except:
sys.exit()
testfile=open('blastqueries.txt','w')
on second thought - maybe i could split single core blast searches onto all cores using your forking in python idea. Manu Prestat's comment bellow, if i understand it correctly seems to suggest this..
sudo renice -20 tblastn_pid to increase process priority to high!?
See http://docs.python.org/library/subprocess.html for Python parallel processing library too!
See http://docs.python.org/library/subprocess.html for forking parallel processes in Python or here for more specific parallel modules http://wiki.python.org/moin/ParallelProcessing
will definately try the priority high thing, although the blasting is happening via os.system() so python parallel processes shouldn't be relevant right?
You need to do a "ps aux | grep tblastn" first to get the pid and then "sudo renice -20 pid" ;)
Having a framework setup to run applications in parallel is always useful too :)
I really got interested in the script! Could share it?
haha yes sure, but word of warning! this is my first script EVER. so it might be buggy and inefficient!
Thanks, I'm sure it will useful for a lot of people :) And as time passes I'm sure the community will make it robust and efficient!
Good point, what do you think of it?
I am trying to make fast blast with the following script...it is showing following error
It simply seems that the blast tool "makeblastdb" is not installed or your shell/Perl does not know where to find it. Can you run/execute "makeblastdb" from you command line manually? If not, solve that problem first by adding the BLAST tools to your PATH as explained in the BLAST documentation.
Now, the error are..
Hm, it seems that the pre-processing is not done correctly and the sub-files are not created correctly, i.e., they are empty but given your information, I can not verify why yet.
1. Question: Can you give me the exact command line and arguments with that you execute the script?
2. Question: Did you recognize that the script searches for input files with the suffix ".fasta" in the given directory? Do your input files match that requirement?
3. Question (as I just had a look into the code): did you realize, that the script matches the input queries against themselves and not against an external BLAST database? Is this really what you want?