Process 50k+ sequences with Blastx for Blast2Go
8.6 years ago
satshil.r ▴ 50

Hey all,

I'm trying to process a large transcriptome fasta file with over 50k sequences (just a bit over 53k). I setup a local blast+ instance and the latest nr database (up to nr.25). I started blastx with the following command: blastx -query fasta.fa -out blastx.xml -outfmt 5 -eval 1e-3 -num_threads 32

So far it's processed only 3500 sequences over 2 days. It's a fairly decent workstation, 2x Xeon E2560 V2's with 128GB of ram. From our previous experience this shouldn't take over a few days, although at this rate it seems like it's going to take a long time. The output is also quite large for only 3500 sequences, it's already at 1.5GB.

How can I optimize blastx for importing into blast2go? I'm currently reading up on how to parallelize blastx, but I'm not sure if there are better options out there.

Thanks!

I started blastx with the following command: blastx -query fasta.fa -out blastx.xml -outfmt 5 -eval 1e-3 -num_threads 32

How is that possible? You didn't even define a db. Also, it would make sense to opt for refseq_protein over nr since the non-refseq seqs in nr probably can't be linked to go terms anyway (I could be wrong).

Sorry, I forgot to include that command in this post. I used the nr database in the commands.

8.6 years ago
rtliu ★ 2.2k

My suggestion based on page 10 of blast2go manual

1. use -max_target_seqs 10
2. use -word_size 5 (default 3, more sensitive but slower)

It is a big task, be patient or find to a cluster like TACC. (https://wikis.utexas.edu/display/bioiteam/split_blast)

Thanks for the help!

How long do you expect it should take to finish a job like this?

Hard to say, maybe 5-10 weeks if the workstation is dedicated to the blastx search.