BLAST: Is there a difference between splitting queries and using more threads?
5
3
Entering edit mode
7.9 years ago
JV ▴ 440

Hi,

I'm wondering:

If I want to blast a large number of protein sequences against the ncbi-nr database (say for example in order to analyse the species and function composition with MEGAN), which of these options would be more sensible:

A.) to split the queries into subsets and run more jobs in parallel (using less threads each)

or

B.) to blast all queries in one job but using more threads

or doesn't it matter at all which of both I choose?

I was under the impression that simply using twice as many threads should have almost exactly the same effect as splitting the query data in two subsets and running them in parallel. Is this assumption wrong?

BLAST threads split-queries • 8.6k views
2
Entering edit mode

I was under the impression that simply using twice as many threads should have almost exactly the same effect as splitting the query data in two subsets and running them in parallel. Is this assumption wrong?

typically threaded programs can share memory space but may contend for resources that only one thread may use at a time (for example updating shared values) and there may be other overheads associated with the threading.

When running separate processes memory is not shared and there is no contention between programs (other than that for the overall computational resources). But each program may load a separate copy the same information.

The exact amounts of overheads are typically not that easy to estimate but for example running ten blast processes independently will quite certainly use a lot more memory than running a blast process with ten threads.

0
Entering edit mode

Yes, the thought that, if i split the data and ran multiple processes, the complete ncbi-nr database would have to be loaded into memory multiple times made me prefer running more threads for one process than more processes.

It DOES seem that my blast is in fact consistently using all 12 threads that i assigned to it. It usually appears as "sleeping" (even though it is using 1200% cpu) when I look it up with "top", but nonetheless it produces output, so it is running.

However, it is taking so impossibly long (it took over a week to blast 5000 sequences against nr) that i will have to consider splitting the input-data and taking up more memory

3
Entering edit mode
7.9 years ago

I got interested and ran some tests. When the target was a database with a single sequence there seemed to be no difference when using more than one thread.

Then searching 10000 short sequences agains all bacterial genomes:

• 1 thread - 59 s
• 8 threads - 21 s (2.8 times faster, same memory consumption)
• 8 processes - 27s (2.1 times faster, 8 times more memory consumption)
0
Entering edit mode

Any pointers on resources for such benchmarking experiments, Istvan? I am amazed by how fast you did this!

1
Entering edit mode

I have simulated reads from a bacterial genome with wgsim then blasted against bacterial nt. For the parallel process I split the files with the unix split command into approximately 8 pieces and ran those with parallel.

I will say though I don't know who universal these observations are - but I just was curious - we do use blast to test for contamination of short reads and this is what I used for the test. There were also other programs running in the background - that too could impact performance differently when it comes to processes vs threads. Finally there is may be fixed costs associated with starting blast that amortize differently for longer test.

0
Entering edit mode

Thanks for testing and for sharing!

However, My tests gave me completely different results (see my answer below): MUCH more speed with more processes compared to more threads. Do you think this could be because in your case the size of the reference database was larger than in my case? Also I did not use gnu parallel, but started the jobs in different subshells using a "for" loop.

Also: how short were your test-input-sequences, exactly? I'm intrigued because your blasts took a few minutes with 10.000 sequences against nt, while my blast with about 5000 protein sequences against nr is still running after 1 week (with 12 threads, low load on the servers and lots of RAM still available). I'm really wondering what is going wrong with my blasts on our server.

0
Entering edit mode

That does not surprise me that much. There are many factors that can influence the results. For example one could be maxing out disk IO with multiple processes and then that becomes the limiting factor. Alternatively long alignments may be more CPU bound in which case separate processes may better make use of that.

The best is to measure it for each particular usecase and system as well.

As for the question: reads were 70bp long and were aligned against all bacterial DNA sequences. It was meant to simulate the use case of looking for contamination in a NGS sample of bacterial origin.

2
Entering edit mode
7.9 years ago
JV ▴ 440

Well, AFTER posting the question here I found this rather helpful post on this topic (But I DID look beforehand, honestly):

This guy found that, at least when blasting against a relatively small database, adding more threads does not really improve the speed of the blast-job.

In fact, it seems several people are suspecting that blast+ uses multiple threads only for a small part of the calculation process.

So this seems to suggest that it is indeed better to split query data than to increase thread count.

1
Entering edit mode

Oh wait, I forgot. blastall is better than blast+ when working with smaller query sequences. And yes, there is a definite bottleneck in the process when it comes to threading - not all phases of blast/blast+ benefit equally from it.

0
Entering edit mode

hmm, but as far as i could gather blast+ is supposed to be far more faster than blastall, which is why I personally would prefer it for running blasts on large datasets. I know the results are not 1:1 comparable to blastall results, but for this reason I'm simply switching to blast+ entirely.

1
Entering edit mode

he did not actually measure the time it takes when running blast in parallel ... that is quite the omission IMHO .... it seems that he simply extrapolates that does not sound right.

0
Entering edit mode

True... It seems that I will simply have to compare multiple processes vs multiple threads myself.

I'll post the results here when I have them (and have the time).

2
Entering edit mode
7.9 years ago
JV ▴ 440

I also did some tests now and want to show how these compare.

The query data was 2000 random proteins from the mouse genome.

the blast-database was a hash_indexed database of >37000 proteins from the zebrafish genome

First single processes with different thread counts:

• 1 process, 1 thread: 23m 18s
• 1 process, 2 threads: 18m 43s
• 1 process, 4 threads: 16m 43s
• 1 process, 8 threads: 15m 46s

Now as a comparison, running multiple processes:

• 8 processes, 1 thread each: 3m 45s
• 4 processes, 2 threads each: 4m 31s
• 2 processes, 4 threads each: 6m 59s
• 2 processes, 8 threads each: 7m 4s (<-- huh? more than with 4 threads? probably just standard deviation)

So in my case (and with a comparably small database) running more parallel processes is significantly faster than running just one process with multiple threads. But as Istvan Albert said, these results are probably not directly transferable to all systems. I'd also have to check how this scales when blasting against the huge nr database.

1
Entering edit mode
7.9 years ago
Ram 37k

I always start off adding more threads. If I have too many sequences, I go for batches of 1000, each with multiple threads.

1
Entering edit mode
7.9 years ago

In my case, matching 1000000 sequences against a small database of 60 highly homologous sequences, I found that threading through -p option doesn't provide any seedup, so one have to do query splitting. So taking in mind Istvan's comment I think built-in threading is efficient in case query count << database size.

Generally speaking I think that speedup is a function of query count, database size and k-mer count distributions, i.e. homology levels. Unfortunately I'm too lazy to perform such a benchmark :(