Question: trying to use gnu parallel and blast
1
gravatar for pablo61991
10 months ago by
pablo6199140
pablo6199140 wrote:

Dear Biostar community,

I have problems to run a blastx against nr database. To optimize the process I have read some post which mentioned gnu parallel as a solution to optimize the use of multiple CPUs by ncbi-blast (local mode). Based on different post of this forum I have adapted my code until I stooped to receive error alerts. However, I think the program is not understanding me when I try to modify the predetermined outfmt 6.

There is the chunk of code:

module load gcc
module load ncbi-blast
module load parallel

cat MyTranscriptome.fasta | parallel -q -j 24 --block 100M --recstart '>' --pipe blastx -db /blastdb/nr -num_threads 1 -evalue 1e-5 -outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 10 -max_hsps 1 > MyTranscriptome_vs_nr.outfmt.6

Before come here to ask you I have serached in to post related with this topic in this (and others) forum but I did not found how to fix the problem.

JUST to clarify (edit), the code ran without problem in a classic way (using -num_threads to my max. numer of cores) so the base code works fine. My problem is the gnu parallel implementation.

Thank you for your time!

ADD COMMENTlink modified 10 months ago • written 10 months ago by pablo6199140
1

I just tried it with some made-up test data and it works fine for me, I get the output that I expected - what is the error you're seeing?

ADD REPLYlink written 10 months ago by Philipp Bayer5.4k
1

Oh shame on me... I didn't try it on a subset of data and maybe it just show and output when the blast finished. I mean, I have run this command for 12 h to try to calculate the % of the transcriptome blasted and estimate how much time I need. However, my output was an empty file. That makes me go in to a alert mode.

If it runs properly over a test dataset, the command is fine.

Thank you for your time and sorry I forget run a test file after run in panic.

ADD REPLYlink written 10 months ago by pablo6199140

Good to see it works for you :)

ADD REPLYlink written 10 months ago by Philipp Bayer5.4k

why not just

blastx -num_threads 24 (...)

?

ADD REPLYlink written 10 months ago by Pierre Lindenbaum106k

Isn't it with single quotes as -

 -outfmt '6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore'

Also, why not simply use threads as Pierre suggested?

ADD REPLYlink modified 10 months ago • written 10 months ago by Rohit1.3k

All stages of blast are not parallel so OP's approach has potential to be faster. IO might be an issue with 24 threads though..

ADD REPLYlink written 10 months ago by 5heikki7.2k

I had a problem before while using multiple IO while using blast, I thought it was a system-specific problem though :\

ADD REPLYlink written 10 months ago by Rohit1.3k

If you are using a job scheduler on a cluster to manage these jobs then there is no need to/advantage of using parallel.

ADD REPLYlink written 10 months ago by genomax46k
1
gravatar for pablo61991
10 months ago by
pablo6199140
pablo6199140 wrote:

I have read in this forum a topic discussing how blast use multiple-core resources with this option (link here):

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

In a short search maybe this don't make a big difference, but in a search against nr and using shared resources I need to optimize my search as much as I can. For example:

http://www.ettemalab.org/using-for-loop-vs-gnu-parallel-for-blast/

Author reduce the time need by 1/10 and a search against nr could spend >20 days...

Thank you all for your reply.

ADD COMMENTlink written 10 months ago by pablo6199140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 926 users visited in the last hour