Question

Optimum setting for local blastp for ~10K sequences

2

Entering edit mode

10 months ago

sodiumnitrate ▴ 20

Hi all,

I'm trying to blast around 10,000 protein sequences against nr with blastp. In the past, using 100-sequence chunks and a single CPU each had worked well for blastn, but blastp seems to be much slower. A .fasta file with 100 sequences, running on a single core has not yet produced an output in 55 minutes.

I have BLAST+ installed in an HPC environment, with the datasets downloaded and indexed appropriately. I have tried blasting only one sequence using 16 cores:

blastp -query sequence.fasta -db nr -out test -outfmt 7 -num_threads 16

and it took around 10 minutes. The same sequence takes about a minute to process on the blast web server. I know it should go faster (per sequence) if I blast multiple sequences at once. Is there a way I can figure out what the optimum ratio of # of sequences vs. # of cores would be (other than trial and error, I guess)? I have access to 1000 CPUs at once, so it would be nice to find a decent balance.

Also, why is the web server much faster? Does it bundle together multiple queries or something? Or does our local blast setup potentially suffer from disk I/O issues?

blast blastp hpc • 908 views

ADD COMMENT • link 10 months ago by sodiumnitrate ▴ 20

score 1 · Answer 1 · 2023-06-01

1

Entering edit mode

10 months ago

GenoMax 141k

It is possible that NCBI keeps the entire nr indexes in memory and searches against that so there is no I/O involved. NCBI now also uses a mmseqs2 clustered version of nr as an option to speed searches on the web. These clustered sequences are not yet available for local downloads.

You could try creating a RAM disk of 500-600 GB (if you have access to such) and try that. If you do have a RAMdisk this size and 1000 cores then at that point it would likely be your local storage that will become the bottleneck.

That said use DIAMOND instead (it is a much faster protein alignment tool) : https://github.com/bbuchfink/diamond

As long as you use a new version it can use pre-made nr indexes.

ADD COMMENT • link 10 months ago by GenoMax 141k

0

Entering edit mode

Thank you for the explanation! I managed to get diamond to work, but I'm having trouble getting it to run faster than blast. For a single test sequence, local blastp takes about 5 minutes, while diamond blastp took a little over 15 mins. Running the same sequence on the blast web server took 10 seconds or so. So I guess now the question becomes: how do I optimize the ratio of number of sequences per file and the number of CPUs for diamond...

I have a file with 100 sequences running with diamond, but I'll have to wait a lot longer to see how long it takes. Blastp was able to do ~40 sequences in 16 hours with a single CPU.

ADD REPLY • link 10 months ago by sodiumnitrate ▴ 20

1

Entering edit mode

Running the same sequence on the blast web server took 10 seconds

Don't try to compare anything local with NCBI's web blast infrastructure, for you will always come up short :-)

Sounds like you are lucky to have reasonably adequate hardware. DIAMOND actually works well with large number of sequence so put all 10000 in one query

DIAMOND is optimized for large input files of >1 million proteins. Naturally the tool can be used for smaller files as well, but the algorithm will not reach its full efficiency.

There are some additional notes for distributed computing as well: https://github.com/bbuchfink/diamond/wiki/6.-Distributed-computing

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

Thank you for the info! It turned out that running all 10k at once is actually faster than running one or two at a time!

ADD REPLY • link 10 months ago by sodiumnitrate ▴ 20