trying to use gnu parallel and blast
1
1
Entering edit mode
4.9 years ago
pablo61991 ▴ 100

Dear Biostar community,

I have problems to run a blastx against nr database. To optimize the process I have read some post which mentioned gnu parallel as a solution to optimize the use of multiple CPUs by ncbi-blast (local mode). Based on different post of this forum I have adapted my code until I stooped to receive error alerts. However, I think the program is not understanding me when I try to modify the predetermined outfmt 6.

There is the chunk of code:

module load gcc

cat MyTranscriptome.fasta | parallel -q -j 24 --block 100M --recstart '>' --pipe blastx -db /blastdb/nr -num_threads 1 -evalue 1e-5 -outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 10 -max_hsps 1 > MyTranscriptome_vs_nr.outfmt.6


Before come here to ask you I have serached in to post related with this topic in this (and others) forum but I did not found how to fix the problem.

JUST to clarify (edit), the code ran without problem in a classic way (using -num_threads to my max. numer of cores) so the base code works fine. My problem is the gnu parallel implementation.

blast parallel gnu parallel blastx optimization • 3.4k views
1
Entering edit mode

I just tried it with some made-up test data and it works fine for me, I get the output that I expected - what is the error you're seeing?

1
Entering edit mode

Oh shame on me... I didn't try it on a subset of data and maybe it just show and output when the blast finished. I mean, I have run this command for 12 h to try to calculate the % of the transcriptome blasted and estimate how much time I need. However, my output was an empty file. That makes me go in to a alert mode.

If it runs properly over a test dataset, the command is fine.

Thank you for your time and sorry I forget run a test file after run in panic.

0
Entering edit mode

Good to see it works for you :)

0
Entering edit mode

why not just

blastx -num_threads 24 (...)


?

0
Entering edit mode

Isn't it with single quotes as -

 -outfmt '6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore'


Also, why not simply use threads as Pierre suggested?

0
Entering edit mode

All stages of blast are not parallel so OP's approach has potential to be faster. IO might be an issue with 24 threads though..

0
Entering edit mode

I had a problem before while using multiple IO while using blast, I thought it was a system-specific problem though :\

0
Entering edit mode

If you are using a job scheduler on a cluster to manage these jobs then there is no need to/advantage of using parallel.

4
Entering edit mode
4.9 years ago
pablo61991 ▴ 100

I have read in this forum a topic discussing how blast use multiple-core resources with this option (link here):

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

In a short search maybe this don't make a big difference, but in a search against nr and using shared resources I need to optimize my search as much as I can. For example:

http://www.ettemalab.org/using-for-loop-vs-gnu-parallel-for-blast/

Author reduce the time need by 1/10 and a search against nr could spend >20 days...

0
Entering edit mode

Thank you Pablo for sharing these resources. Parallelizing seems promising for someone, like me, that needs to reduce blastx processing time.

I still have some doubts about the use of -j option. If I correctly understood, in the Ettemalab website -j option is used to run blastjobs in parallel, while in the other Biostars discussion they suggested to use -j to run jobs serially.

Moreover how can I best choose how many blastjobs to be performed? Why did you chose -j 24 in your first post?