I am trying to tune my tophat2 runs following the example of this blog. They suggest trying different values for the
To get an idea as to the numbers I should use, I ran tophat using :
tophat2 -p 20 -o out/ -a 8 --library-type fr-firststrand --no-coverage-search /path/to/Bowtie2Index/Mus_musculus.GRCm38 R1.fastq.gz R2.fastq.gz
Looking at the bam alignment file, I selected all the Read1 reads (SAM FLAG bit 0x40 set) which were properly aligned (bit 0x2 set) and were not secondary alignments (bit 0x100 _not_ set). The mean of abs(TLEN) is 3400 and the standard deviation of abs(TLEN) is 369034. Double checking the file, there indeed large values for TLEN (e.g. there are values in the 100,000s and millions).
Picking these as my values for the
mate-std-dev seems like a very poor decision. Clearly there is a tail of reads with very large template lengths that are skewing the distribution (which is _not_ Gaussian). So I have a few of questions:
- Is the template length (TLEN) from tophat computed from the genome or transcriptome?
- What would the best way to select the mate-inner-dist?
- Looking at the
run.logit appears that bowtie2 uses the default value for --maxins=500 (ie. it is not explicitly changed). Does this matter since I have values of TLEN > 500?