Question

Using bowtie2 in parallel

0

Entering edit mode

9.2 years ago

senzasord • 0

We're trying to use bowtie2 to find exact matches to short DNA sequences in a complete genome. We may search for hundreds of thousands of short sequences at a time. At first, it seemed like the way to do this is to spawn a bunch of threads and run lots of separate queries in parallel. However, we're finding that on a 30 core machine hitting a single index on a local disk, any more than 3 threads results in a significant slowdown.

We are using the '--mm' option which according to the manual, tells bowtie2 to use memory-mapped I/O so many bowtie's can share the index. Used interactively for a single query, --mm resulted in a noticeable speedup. However, I'm wondering if we're running into problems where the shared, memory-mapped I/O requires some mutex coordination which is causing things to bog down when hit by multiple threads. In that case, could we increase throughput by taking a hit each individual query but utilize our full 30 cores?

alignment bowtie2 • 6.5k views

ADD COMMENT • link updated 9.2 years ago by TriS ★ 4.7k • written 9.2 years ago by senzasord • 0

Ram · Accepted Answer · 2015-01-27

0

Entering edit mode

9.2 years ago

TriS ★ 4.7k

if I got your question right, have you tried the -p option?

from the bowtie2 manual:

Performance tuning
If your computer has multiple processors/cores, use -p

The -p option causes Bowtie 2 to launch a specified number of parallel search threads. Each thread runs on a different processor/core and all threads find alignments in parallel, increasing alignment throughput by approximately a multiple of the number of threads (though in practice, speedup is somewhat worse than linear).

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.2 years ago by TriS ★ 4.7k

0

Entering edit mode

Yes. In our standard scheme, this doesn't help, because we handle each sequence by a separate query (and therefore, a heavyweight process). Adding threads in that way doesn't seem to help because there's not enough work done processing a single query to justify the threads. We could try handling multiple queries in a single process, in which case '-p' might help, but that works best for batch rather than on-line processing, and we need to be able to do both quickly.

ADD REPLY • link 9.2 years ago by senzasord • 0

0

Entering edit mode

It's not documented anywhere, but the bowtie2 source code suggests that you should be able to compile it easily enough as a library, so perhaps you can directly integrate it into whatever your current pipeline is that way.

BTW, the other possibility would just be to use a FIFO. Whether this will work will depend on the details of your pipeline.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k