Question: blastn much slower on longer subject sequences?
gravatar for liorglic
5 weeks ago by
liorglic330 wrote:

I encountered a strange issue running blastn. I use the same set of query sequences in two scenarios:
1. Running against a DB of genomic contigs
2. Running against the same contigs, after they have been scaffolded into pseudomolecules
The second run is about x100 slower!
I should note that the pseudomolecules are pretty large - this is a plant genome with chromosomes each over 600 Mbp.

Does this even make sense? why would blast be slower when the sequences in the DB are longer? and is there any way I can improve performance?
I'd have just switched to Blast or DIAMOND, but this Blast run is invoked by BUSCO, so I don't really have a choice. The command run by BUSCO looks like:
tblastn -evalue 0.001 -num_threads 40 -query ancestral.fasta -db scaffolds.fasta -out tblastn.tsv -outfmt 7


blastn blast • 128 views
ADD COMMENTlink modified 5 weeks ago by JC11k • written 5 weeks ago by liorglic330

dont you think the size of the search space should scale up the search time?

ADD REPLYlink written 5 weeks ago by karl.stamm3.8k

That's exactly the point - the search space size is the same, it's only arranged into fewer but larger sequences. Or maybe I didn't understand what you tried to say...

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by liorglic330

I suppose the search space is much wider than you think. Consider searching two 10-base sequences vs one 20-base sequence. If you're doing a sliding window of 9 bases, the first (2x10) has just four locations (left and right side of each)... while the 1x20 'database' has twelve (11?) ways to place the 9 base query. It's three times more searching and we havent even considered reverse-comp or partial matches or split matches. Allowing for gaps in the match (tophat style) will blow this scalar up. And I'm still not sure your two databases really have the same total base-count, 'scaffolded into molecules' doesn't mean one-to-one and could easily have duplications or repeats.

ADD REPLYlink written 5 weeks ago by karl.stamm3.8k

I see. That indeed makes sense, but it's still a bit surprising how bad the effect is. Running ~1600 queries against a genome about 50% larger than the human genome is taking more than two days using 40 CPUs. I still suspect that there's something else going on there. I'll update if I ever find out what.

ADD REPLYlink written 5 weeks ago by liorglic330
gravatar for JC
5 weeks ago by
JC11k wrote:

My guess is you are expanding the search space using larger sequences, this can change the way blast identify the initial hits and try to extend them. Also could be the memory, if you are using a limited memory machine, loading larger sequences can increase the mem used and your system could be using disk space to fit it.

ADD COMMENTlink written 5 weeks ago by JC11k

I don't think there's a memory issue - the job is limited to 100g RAM, and it's actually using 38g. In general, it seems like the machine is doing fine - lot's of free RAM, no swapping and in fact it looks like most of the time Blast is only using one CPU (?).
Regarding your other suggestion - is there anything I can do about it? are there parameters that control it?

ADD REPLYlink written 5 weeks ago by liorglic330

on the CPU usage: majority of the blast process is actually single threaded, it's only a very small part that is effectively multi-threaded.

ADD REPLYlink written 5 weeks ago by lieven.sterck8.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour