Script To Submit And Manage Blast Jobs On A Cluster?
5
4
Entering edit mode
11.0 years ago
Shellfishgene ▴ 310

Hi,

I make use of a local cluster (managed through LSF) to BLAST large sequence sets. I wrote a basic script to split a fasta file and submit each part as a different BLAST job. Does anyone know of such a script that is more advanced, and does things such as merge the ouput or restart failed jobs?

I could extend mine to do that, but I have this feeling that I would be reinventing the wheel.

And btw we have mpiBLAST installed, but I think it's not actually any faster than splitting the input file.

blast clustering • 4.8k views
2
Entering edit mode
11.0 years ago

Not sure I would merge the the results but rather parse each XML file separately which is much faster. see this for XML parsing using XMLstarlet XSLT:

if you must merge you can simply use CAT.

1
Entering edit mode
11.0 years ago

Paracel does what you want, but is (really) not free. I did a similar python script as yours that divides the input (I agree with you on this point) and uses MPI, necessary for dealing with several nodes, I can share if you need.

0
Entering edit mode

Paracel seems like the thing I was looking for, but I doubt we'll want to pay for it. It would be great if you could share your script. Mine is really simple, it just divides and submits jobs to LSF. I'm not sure how yours involves MPI?

0
Entering edit mode

1
Entering edit mode
11.0 years ago
Ahdf-Lell-Kocks ★ 1.6k

An option is to use eHive, which is free and open source:
http://www.biomedcentral.com/1471-2105/11/240

The processing of the jobs can go from the very simple list of commands to the very complex pipelining, like the ones used in Ensembl and other projects out there. A simple example of command line piping into a queueing system, with fail tolerance, resource management (num. CPUs, memory, etc), all in one script is here:

ensembl-hive/scripts/cmd_hive.pl


also have a look at InputFile_SystemCmd:

init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::InputFile_SystemCmd_conf -ensembl_cvs_root_dir $HOME$dbdetails -inputfile very_long_list_of_blast_jobs.txt
beekeeper.pl -url \$dburl -loop


There are a few Perl dependencies to get it working, and then the backend can be a no-frills simple sqlite which will work fine for tens to few hundreds of concurrent jobs, or a MySQL backend that usually works well for hundreds to close to a thousand concurrent jobs.

LSF support comes out of the box in eHive. There is also support for some other queueing systems, like SGE. The same script that you use in your farm you can test first in your workstation without the need of a queueing system, just using the '-local' option.

0
Entering edit mode
11.0 years ago
Gorysko ▴ 100

If I'm correct blast2 has key "-a" were you could indicate how many processors You want to use

2
Entering edit mode

You're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separated jobs is much more efficient.

0
Entering edit mode

you're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separate d jobs is much more efficient.

0
Entering edit mode

Also if I'd use -num_threads 8 I'd have to get a full node on the cluster for each job, they spend more time in the queue then it seems.

0
Entering edit mode

I am not too familiar LSF, but in some batch-job management systems jobs requiring more CPU cores will get lower priority so dividing the query into many instances without using the multithread flag is faster.

0
Entering edit mode
11.0 years ago
Yannick Wurm ★ 2.4k

Do you absolutely need to get xml output? In xml ouput the local alignments are necessarily calculated (this was the case in legacy BLAST - I don't know if this has changed with Blast+). But calculating them is slow.

So you may be able to dramatically accelerate things by using table output. See also A: Is Blast+ Running As Fast As It Could ?

0
Entering edit mode

I think local alignments are computed anyway, if not, how does work the scoring function?

0
Entering edit mode

I think it's approximated rather than optimized - see fig 1 http://goo.gl/9UBhE