Tophat align_summary.txt and samtools flagstat accepted_hits.bam disagree
1
0
Entering edit mode
4.9 years ago
colin.kern ▴ 970

I've aligned an RNA-Seq library using Tophat, and I'm confused about the output as there seems to be some disagreement that I can't figure out. The align_summary.txt that Tophat creates looks like this:

Left reads:
Input     :  49801387
Mapped   :  46258301 (92.9% of input)
of these:   1703360 ( 3.7%) have multiple alignments (103886 have >20)
Input     :  49801387
Mapped   :  45542088 (91.4% of input)
of these:   1680180 ( 3.7%) have multiple alignments (103887 have >20)

Aligned pairs:  44100265
of these:   1622622 ( 3.7%) have multiple alignments
261400 ( 0.6%) are discordant alignments
88.0% concordant pair alignment rate.


And the output of samtools flagstat on the accepted_hits.bam file produced by Tophat is:

\$ samtools flagstat accepted_hits.bam
0 + 0 duplicates
105064647 + 0 mapped (100.00%:-nan%)
105064647 + 0 paired in sequencing
89670194 + 0 properly paired (85.35%:-nan%)
101093928 + 0 with itself and mate mapped
3970719 + 0 singletons (3.78%:-nan%)
5036220 + 0 with mate mapped to a different chr
248094 + 0 with mate mapped to a different chr (mapQ>=5)


So at the bottom of align_summary it says that we have an 88% concordant pair alignment rate. Since I input 49,801,387 pairs, then 88% of that is 43,825,220.56. It says we have 44,100,265 aligned pairs. We can subtract the 261,400 that are discordant alignments and that gives us 43,838,865. Close, but not the same. How is that 88% calculated?

Furthermore, if we go to the samtools flagstat output for accepted_hits.bam file, it reports 89670194 reads properly paired, or if we divide by two 44,835,097. So we have three numbers for the number of aligned pairs, all of which are similar but not exactly the same. Does anyone have any insight on this?

A related question: When you do RNA-seq alignment and want to report your alignment rate, how do you get that number?

RNA-Seq tophat alignment samtools • 2.4k views
4
Entering edit mode
4.9 years ago

(44100265 - 261400) / 49801387 = 88% when rounded. You're expecting too much precision there apparently.

Samtools is telling you the raw number of "properly paired" alignments in the file. If any given read has multiple entries then it'll get counted multiple times....which is the source of the discrepancy you're seeing.

If I needed to report a number, I'd report the 92.2% overall alignment rate. You can probably get something similar with samtools view -c -F 260 accepted_hits.bam`.