Question: trying to use gnu parallel and blast
1
gravatar for pablo61991
5 months ago by
pablo6199130
pablo6199130 wrote:

Dear Biostar community,

I have problems to run a blastx against nr database. To optimize the process I have read some post which mentioned gnu parallel as a solution to optimize the use of multiple CPUs by ncbi-blast (local mode). Based on different post of this forum I have adapted my code until I stooped to receive error alerts. However, I think the program is not understanding me when I try to modify the predetermined outfmt 6.

There is the chunk of code:

module load gcc
module load ncbi-blast
module load parallel

cat MyTranscriptome.fasta | parallel -q -j 24 --block 100M --recstart '>' --pipe blastx -db /blastdb/nr -num_threads 1 -evalue 1e-5 -outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 10 -max_hsps 1 > MyTranscriptome_vs_nr.outfmt.6

Before come here to ask you I have serached in to post related with this topic in this (and others) forum but I did not found how to fix the problem.

JUST to clarify (edit), the code ran without problem in a classic way (using -num_threads to my max. numer of cores) so the base code works fine. My problem is the gnu parallel implementation.

Thank you for your time!

ADD COMMENTlink modified 5 months ago • written 5 months ago by pablo6199130
1

I just tried it with some made-up test data and it works fine for me, I get the output that I expected - what is the error you're seeing?

ADD REPLYlink written 5 months ago by Philipp Bayer4.9k
1

Oh shame on me... I didn't try it on a subset of data and maybe it just show and output when the blast finished. I mean, I have run this command for 12 h to try to calculate the % of the transcriptome blasted and estimate how much time I need. However, my output was an empty file. That makes me go in to a alert mode.

If it runs properly over a test dataset, the command is fine.

Thank you for your time and sorry I forget run a test file after run in panic.

ADD REPLYlink written 5 months ago by pablo6199130

Good to see it works for you :)

ADD REPLYlink written 5 months ago by Philipp Bayer4.9k

why not just

blastx -num_threads 24 (...)

?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum102k

Isn't it with single quotes as -

 -outfmt '6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore'

Also, why not simply use threads as Pierre suggested?

ADD REPLYlink modified 5 months ago • written 5 months ago by Rohit1.3k

All stages of blast are not parallel so OP's approach has potential to be faster. IO might be an issue with 24 threads though..

ADD REPLYlink written 5 months ago by 5heikki6.9k

I had a problem before while using multiple IO while using blast, I thought it was a system-specific problem though :\

ADD REPLYlink written 5 months ago by Rohit1.3k

If you are using a job scheduler on a cluster to manage these jobs then there is no need to/advantage of using parallel.

ADD REPLYlink written 5 months ago by genomax39k
1
gravatar for pablo61991
5 months ago by
pablo6199130
pablo6199130 wrote:

I have read in this forum a topic discussing how blast use multiple-core resources with this option (link here):

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

In a short search maybe this don't make a big difference, but in a search against nr and using shared resources I need to optimize my search as much as I can. For example:

http://www.ettemalab.org/using-for-loop-vs-gnu-parallel-for-blast/

Author reduce the time need by 1/10 and a search against nr could spend >20 days...

Thank you all for your reply.

ADD COMMENTlink written 5 months ago by pablo6199130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour