Question

Is there any fixed limit on the number of threads in running Tophat?

0

Entering edit mode

7.7 years ago

tunl ▴ 80

(1) I know that the number of threads specified (-p option) should not be larger than the number of cores available in the system. But I am wondering if there is any fixed limit on the number of threads in running Tophat regardless how many cores are available?

In other words, if the number of threads reaches certain high number, would it possibly cause segmentation fault or not speed up the process in any way?

(2) We have 24 cores in our system. I’m not sure if using 20 threads (-p 20) to run Tophat is safe and effective? What about using 24 threads (-p 24)?

(3) Since we have many large samples to run, I’m trying to figure out a good way to parallel process them.

Between these two choices: (a) running single Tophat with “-p 20” sequentially (b) running two Tophat with “-p 10” simultaneously, which choice takes less time for the same amount of samples?

Would choice (b) be faster since Tophat has some steps which do not use multithreading (but I pre-built the transcriptome index)?

On the other hand, would choice (b) use doubled memory than choice (a) which may possibly slow down the process?

Any advice would be greatly appreciated.

Thank you very much!

RNA-Seq Tophat • 3.5k views

ADD COMMENT • link updated 7.7 years ago by mastal511 ★ 2.1k • written 7.7 years ago by tunl ▴ 80

2

Entering edit mode

You should leave a core or two so the system can use it for other essential processes. If a program tries to use all cores available the the system may become very sluggish and at worst may stop responding to external inputs.

Cores is just one part of the equation. Memory (as noted by @mastal below) is an important consideration as well. Finally you are going to be limited by the throughput of your storage subsystem. On most modern systems CPU's are never the bottleneck, since you will find that cores will generally be waiting for the data to arrive for them to work on.

Do this. Subset (100K reads) a pair of sample files. Go through the possible permutations (10 cores x 2 jobs, 20 core one job etc) with sample files and time them to figure out what looks to be optimal strategy for your system and then go with that for the larger set.

Keep in mind that once you reach the resource limits on the weakest link/component, there can be no additional speedup unless you upgrade the hardware.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot for your advice!

Our system has 128 GB RAM. Trapnell suggested 16 GB RAM and their presented protocol is -p 8. So does it mean about 2 GB per thread? If that’s the case, I guess our 20 threads may be OK?

So “20 threads one job” may end up using the similar memory as “10 threads x 2 jobs”, right?

But I guess “20 threads one job” has one set of bowie index allocated in the memory, while “10 threads x 2 jobs” will have two sets allocated, so “10 threads x 2 jobs” may still take more memory than “20 threads one job”?

ADD REPLY • link 7.7 years ago by tunl ▴ 80

0

Entering edit mode

That is precisely the kind of thing you will discover when you do the small test jobs :-)

You should have enough RAM for these jobs.

ADD REPLY • link 7.7 years ago by GenoMax 141k

1

Entering edit mode

for your attention, TopHat2 declared obsolete by authors you can use instead Hisat2

ADD REPLY • link 7.7 years ago by Medhat 9.7k

1

Entering edit mode

Tophat does not have a limit to the number of threads for multithreaded phases, but when you have a lot of cores, Tophat's run time becomes dominated by its slow singlethreaded phases. So, it does not scale as well to large numbers of cores as alternatives (STAR, BBMap).

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2016-07-29

1

Entering edit mode

7.7 years ago

mastal511 ★ 2.1k

The number of threads you can use may depend on the amount of memory you have available. I recently ran tophat on a cluster, and the more threads I used, the higher the max memory the run required.

ADD COMMENT • link 7.7 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Thanks so much for sharing your experience!

Was it a linear relationship between memory requirement and the number of threads based on your experience?

Do you still remember the memory usages in your run? It'd be really helpful if you could share some stats with us. Thanks a lot!

ADD REPLY • link 7.7 years ago by tunl ▴ 80

1

Entering edit mode

I think the max memory usage was just over 3 Gb per thread, so I ended up requesting 3.5 Gb per thread. On the cluster I was using we had to request the amount of memory per thread, and yes, the max memory used appeared to be more or less linear with the number of threads. I'm sure the memory usage will also depend on the size of your files - length of the reads and the number of reads. This particular dataset was 51 bp SE reads, about 30-40 million reads per file.

ADD REPLY • link 7.7 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Thanks a lot for your information!

This is super helpful.

Do you also remember approximately how long Tophat took to run through a sample (and how many threads you used)?

ADD REPLY • link 7.7 years ago by tunl ▴ 80

0

Entering edit mode

Memory requirements will depend on the size of the genome you are using. Ditto for the time since your data files could be smaller/larger than @mastal511. Finally running jobs on a cluster as opposed to a single server is an apples/oranges comparison.

ADD REPLY • link 7.7 years ago by GenoMax 141k