3.5 years ago by
São Paulo, Brazil
Well, Wei answered my questions in the subread mail list. I think the answers are worthy here at Biostars. Of course, I've added some of my own observations.
When the value provided to '-I' option is greater than 16, subjunc (and also subread aligner) will perform read assembly to try to discover long indels and this will increase the running time of the program.
In the test run cited above, the running time increased from ~2.5 hours to about 10 hours (with 40 threads). For some reason, at a certain point execution appears to be sequential. I don't know if it is worth the effort as it found only 3% more indels and reduced junctions and fusions by a tiny fraction.
Turning on '--complexIndels' option could significantly increase the running time if the error rate in your data is high. So the running time associated with this option is data specific. Subjunc/subread firstly builds a global indel table and then re-align all the reads by taking into account indels in the table. A high error rate in the input data could result in the creation of a large indel table and the realignment step will take much longer.
In fact, on the same dataset after a week running the job didn't complete and I've killed it. It's not clear what a complex indel is.
Subjunc/subread has a hard coded limit on the number of threads used by the program for mapping and this limit is 40 threads. We can change this limit and I do not feel the running time will be improved much by using more threads than that because of the I/O cost from each thread.
Default parameters with --allJunctions
|| Junctions : 174694 ||
|| Fusions : 87827 ||
|| Indels : 184357 ||
|| Running time : 163.7 minutes ||
Indels up to 17bp (-I17) with --allJunctions
|| Junctions : 173885 ||
|| Fusions : 87393 ||
|| Indels : 189873 ||
|| Running time : 588.3 minutes ||
modified 3.5 years ago
3.5 years ago by
Jarretinha ♦ 3.3k