I have Illumina paired end (2*150bp) reads for 30 samples that I mapped to the reference genome using Tophat2. The sequencing providers gave me the following information:
- The fragments used for sequencing range from 200 to 400bp
- The average size is around 303 to 330bp. This includes the adapters.
- The paired end reads (2*150bp) include adapter sequences
I did the first round of mapping (after trimming the adapters: Trimmomatic) using --mate-inner-dist 100 --mate-std-dev 40. I got good concordant pair alignment rate ranging from 75 to 80%.
I now want to optimize the Tophat2 results. So I used RSeQC on the accepted_hits.bam files.
And got these results (for Sample1):
inner_distance.py -i Sample1-accepted_hits.bam -o Sample1-output -r genes.bed
Get exon regions from genes.bed
Load BAM file ... Total read pairs used 1000000
Name Mean Median sd
Sample1-output -98.6345633476147 -113 51.2257769893221
I get very similar values for all the samples.
- Is it fine to use "accepted_hits.bam" of my first tophat2 for running RSeQC?
- Mean values are negative values does this mean that the reads are overlapping? And should I use these mean and std dev values for my next tophat2 run? --mate-inner-dist -99 --mate-std-dev 51
I tried running tophat2 on few samples using the parameters generated by RSeQC and do not find any significant difference in "overall read mapping rate" and "concordant pair alignment rate". Though there was an increase in the "number of Aligned pairs"
So initially I had got 25721951 aligned pairs and then after running Tophat2 with the new parameters it increased to 25722849.
Thanks in advance!