Script To Submit And Manage Blast Jobs On A Cluster?
5
4
Entering edit mode
11.0 years ago
Shellfishgene ▴ 310

Hi,

I make use of a local cluster (managed through LSF) to BLAST large sequence sets. I wrote a basic script to split a fasta file and submit each part as a different BLAST job. Does anyone know of such a script that is more advanced, and does things such as merge the ouput or restart failed jobs?

I could extend mine to do that, but I have this feeling that I would be reinventing the wheel.

And btw we have mpiBLAST installed, but I think it's not actually any faster than splitting the input file.

blast clustering • 4.8k views
ADD COMMENT
2
Entering edit mode
11.0 years ago

Not sure I would merge the the results but rather parse each XML file separately which is much faster. see this for XML parsing using XMLstarlet XSLT:

if you must merge you can simply use CAT.

ADD COMMENT
1
Entering edit mode
11.0 years ago

Paracel does what you want, but is (really) not free. I did a similar python script as yours that divides the input (I agree with you on this point) and uses MPI, necessary for dealing with several nodes, I can share if you need.

ADD COMMENT
0
Entering edit mode

Paracel seems like the thing I was looking for, but I doubt we'll want to pay for it. It would be great if you could share your script. Mine is really simple, it just divides and submits jobs to LSF. I'm not sure how yours involves MPI?

ADD REPLY
0
Entering edit mode

ok, contact me, I will help you

ADD REPLY
1
Entering edit mode
11.0 years ago
Ahdf-Lell-Kocks ★ 1.6k

An option is to use eHive, which is free and open source:
http://www.biomedcentral.com/1471-2105/11/240

The processing of the jobs can go from the very simple list of commands to the very complex pipelining, like the ones used in Ensembl and other projects out there. A simple example of command line piping into a queueing system, with fail tolerance, resource management (num. CPUs, memory, etc), all in one script is here:

ensembl-hive/scripts/cmd_hive.pl

also have a look at InputFile_SystemCmd:

init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::InputFile_SystemCmd_conf -ensembl_cvs_root_dir $HOME $dbdetails -inputfile very_long_list_of_blast_jobs.txt
beekeeper.pl -url $dburl -loop

There are a few Perl dependencies to get it working, and then the backend can be a no-frills simple sqlite which will work fine for tens to few hundreds of concurrent jobs, or a MySQL backend that usually works well for hundreds to close to a thousand concurrent jobs.

LSF support comes out of the box in eHive. There is also support for some other queueing systems, like SGE. The same script that you use in your farm you can test first in your workstation without the need of a queueing system, just using the '-local' option.

ADD COMMENT
0
Entering edit mode
11.0 years ago
Gorysko ▴ 100

If I'm correct blast2 has key "-a" were you could indicate how many processors You want to use

ADD COMMENT
2
Entering edit mode

You're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separated jobs is much more efficient.

ADD REPLY
0
Entering edit mode

you're correct, but the improvement of speed is far from being linear with the number of cpu/cores. Partitioning data and launching separate d jobs is much more efficient.

ADD REPLY
0
Entering edit mode

Also if I'd use -num_threads 8 I'd have to get a full node on the cluster for each job, they spend more time in the queue then it seems.

ADD REPLY
0
Entering edit mode

I am not too familiar LSF, but in some batch-job management systems jobs requiring more CPU cores will get lower priority so dividing the query into many instances without using the multithread flag is faster.

ADD REPLY
0
Entering edit mode
11.0 years ago
Yannick Wurm ★ 2.4k

Do you absolutely need to get xml output? In xml ouput the local alignments are necessarily calculated (this was the case in legacy BLAST - I don't know if this has changed with Blast+). But calculating them is slow.

So you may be able to dramatically accelerate things by using table output. See also A: Is Blast+ Running As Fast As It Could ?

ADD COMMENT
0
Entering edit mode

I think local alignments are computed anyway, if not, how does work the scoring function?

ADD REPLY
0
Entering edit mode

I think it's approximated rather than optimized - see fig 1 http://goo.gl/9UBhE

ADD REPLY

Login before adding your answer.

Traffic: 2240 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6