Question

Subread average running time

0

Entering edit mode

8.2 years ago

Jarretinha 3.4k

I've recently updated a RNA-seq pipeline based on subread from 1.4.6 to 1.5.0-p1 on my linux server. Everything seems fine, except the running time. For the older version, it took about 2 to 3 hours to run a complete subjunc cycle (with --allJunctions) on a TruSeq human RNA-seq dataset (about 66M reads).

Now it's taking days to run on the same server and same dataset. I'm using the binaries for linux x64. Strangely, the parallel support seems better now, using up to 40 threads compared to the 20 of the previous version (the machine supports up to 64 threads). For the current test run, I set the thread limit to 50.

I've already checked the specific forums and was unable to find if it's an issue with my server or with subread.

Did someone test the --complexIndels option?

RNA-Seq subjunc subread • 2.1k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by Jarretinha 3.4k

Ram · Accepted Answer · 2016-01-29

Well, Wei answered my questions in the subread mail list. I think the answers are worthy here at Biostars. Of course, I've added some of my own observations.

When the value provided to '-I' option is greater than 16, subjunc (and also subread aligner) will perform read assembly to try to discover long indels and this will increase the running time of the program.

In the test run cited above, the running time increased from ~2.5 hours to about 10 hours (with 40 threads). For some reason, at a certain point execution appears to be sequential. I don't know if it is worth the effort as it found only 3% more indels and reduced junctions and fusions by a tiny fraction.

Turning on '--complexIndels' option could significantly increase the running time if the error rate in your data is high. So the running time associated with this option is data specific. Subjunc/subread firstly builds a global indel table and then re-align all the reads by taking into account indels in the table. A high error rate in the input data could result in the creation of a large indel table and the realignment step will take much longer.

In fact, on the same dataset after a week running the job didn't complete and I've killed it. It's not clear what a complex indel is.

Subjunc/subread has a hard coded limit on the number of threads used by the program for mapping and this limit is 40 threads. We can change this limit and I do not feel the running time will be improved much by using more threads than that because of the I/O cost from each thread.

Best wishes,

Wei

Default parameters with --allJunctions
||          Junctions : 174694                                                ||
||            Fusions : 87827                                                 ||
||             Indels : 184357                                                ||
||       Running time : 163.7 minutes                                         ||

Indels up to 17bp (-I17) with --allJunctions
||          Junctions : 173885                                                ||
||            Fusions : 87393                                                 ||
||             Indels : 189873                                                ||
||       Running time : 588.3 minutes                                         ||