I've aligned an RNA-Seq library using Tophat, and I'm confused about the output as there seems to be some disagreement that I can't figure out. The align_summary.txt that Tophat creates looks like this:
Left reads: Input : 49801387 Mapped : 46258301 (92.9% of input) of these: 1703360 ( 3.7%) have multiple alignments (103886 have >20) Right reads: Input : 49801387 Mapped : 45542088 (91.4% of input) of these: 1680180 ( 3.7%) have multiple alignments (103887 have >20) 92.2% overall read mapping rate. Aligned pairs: 44100265 of these: 1622622 ( 3.7%) have multiple alignments 261400 ( 0.6%) are discordant alignments 88.0% concordant pair alignment rate.
And the output of samtools flagstat on the accepted_hits.bam file produced by Tophat is:
$ samtools flagstat accepted_hits.bam 105064647 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 105064647 + 0 mapped (100.00%:-nan%) 105064647 + 0 paired in sequencing 52934350 + 0 read1 52130297 + 0 read2 89670194 + 0 properly paired (85.35%:-nan%) 101093928 + 0 with itself and mate mapped 3970719 + 0 singletons (3.78%:-nan%) 5036220 + 0 with mate mapped to a different chr 248094 + 0 with mate mapped to a different chr (mapQ>=5)
So at the bottom of align_summary it says that we have an 88% concordant pair alignment rate. Since I input 49,801,387 pairs, then 88% of that is 43,825,220.56. It says we have 44,100,265 aligned pairs. We can subtract the 261,400 that are discordant alignments and that gives us 43,838,865. Close, but not the same. How is that 88% calculated?
Furthermore, if we go to the samtools flagstat output for accepted_hits.bam file, it reports 89670194 reads properly paired, or if we divide by two 44,835,097. So we have three numbers for the number of aligned pairs, all of which are similar but not exactly the same. Does anyone have any insight on this?
A related question: When you do RNA-seq alignment and want to report your alignment rate, how do you get that number?