Question

blastp too low cpu usage

1

Entering edit mode

8 months ago

biomarco ▴ 50

Hi all,

I'm facing very low CPU usage with blastp 2.14.0 on a virtual machine with 184 GB of RAM and 32 cores. The blast searches I'm running seem to take unusually long time. This is my typical commandline:

blastp -task blastp -db nr -query query_sequence.fasta -num_threads 32 -max_target_seqs 200000 -outfmt 15 -out blast_output.json

I noticed that when the search is launched there are multiple blastp threads populating all the 32 cores as expected, but this lasts just for a few minutes. Then, just 1 thread survives and it hangs there at very low CPU load for many hours (just 3-5% average CPU usage on the single core). This thread uses up to 85% of the RAM.

Is it normal that the CPU load is so low for hours?

blastp blast • 585 views

ADD COMMENT • link 8 months ago by biomarco ▴ 50

score 2 · Answer 1 · 2023-08-15

2

Entering edit mode

8 months ago

Mensur Dlakic ★ 27k

My first suspect would be slow disk, followed by inadequate RAM size. Possibly both.

These days it is quite wasteful to search against the nr database, unless you absolutely must have every single homologous sequence. The database has grown to almost 400 million sequences. I suggest you search against the database clustered at 90% identity, which will cut the database size in half and still give you most of relevant homologs. Not sure if NCBI has started distributing nr90 yet, but you can get an equivalent database from UniRef.

https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/

ADD COMMENT • link 8 months ago by Mensur Dlakic ★ 27k

1

Entering edit mode

Actually I very often cluster the search results at 90-95% identity depending on how many hits I get. Makes indeed total sense to just move to a clustered nr database and use it instead, or at least switch to it when the entire sequence space of a query sequence can't be explored without having to raise -max_target_seqs too much (which is basically what slows down the process). I was waiting for the mmseqs-clustered version NCBI offers through the Blast web interface (should be the one you're referring to), but for some reason it's still flagged as "experimental" and it's not available for download.

The only drawback I see with using a clustered nr is the risk of missing some PDB codes, but this can be easily worked around by running a parallel search on the much smaller pdbaa database.

I will try uniref90 out. Many thanks for the hint!

ADD REPLY • link 8 months ago by biomarco ▴ 50