Question: Tophat align_summary.txt and samtools flagstat accepted_hits.bam disagree
0
gravatar for colin.kern
3.1 years ago by
colin.kern290
United States
colin.kern290 wrote:

I've aligned an RNA-Seq library using Tophat, and I'm confused about the output as there seems to be some disagreement that I can't figure out. The align_summary.txt that Tophat creates looks like this:

Left reads:
          Input     :  49801387
           Mapped   :  46258301 (92.9% of input)
            of these:   1703360 ( 3.7%) have multiple alignments (103886 have >20)
Right reads:
          Input     :  49801387
           Mapped   :  45542088 (91.4% of input)
            of these:   1680180 ( 3.7%) have multiple alignments (103887 have >20)
92.2% overall read mapping rate.

Aligned pairs:  44100265
     of these:   1622622 ( 3.7%) have multiple alignments
                  261400 ( 0.6%) are discordant alignments
88.0% concordant pair alignment rate.

And the output of samtools flagstat on the accepted_hits.bam file produced by Tophat is:

$ samtools flagstat accepted_hits.bam 
105064647 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
105064647 + 0 mapped (100.00%:-nan%)
105064647 + 0 paired in sequencing
52934350 + 0 read1
52130297 + 0 read2
89670194 + 0 properly paired (85.35%:-nan%)
101093928 + 0 with itself and mate mapped
3970719 + 0 singletons (3.78%:-nan%)
5036220 + 0 with mate mapped to a different chr
248094 + 0 with mate mapped to a different chr (mapQ>=5)

So at the bottom of align_summary it says that we have an 88% concordant pair alignment rate. Since I input 49,801,387 pairs, then 88% of that is 43,825,220.56. It says we have 44,100,265 aligned pairs. We can subtract the 261,400 that are discordant alignments and that gives us 43,838,865. Close, but not the same. How is that 88% calculated?

Furthermore, if we go to the samtools flagstat output for accepted_hits.bam file, it reports 89670194 reads properly paired, or if we divide by two 44,835,097. So we have three numbers for the number of aligned pairs, all of which are similar but not exactly the same. Does anyone have any insight on this?

A related question: When you do RNA-seq alignment and want to report your alignment rate, how do you get that number?

rna-seq samtools alignment tophat • 1.6k views
ADD COMMENTlink modified 3.1 years ago by Devon Ryan91k • written 3.1 years ago by colin.kern290
4
gravatar for Devon Ryan
3.1 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

(44100265 - 261400) / 49801387 = 88%` when rounded. You're expecting too much precision there apparently.

Samtools is telling you the raw number of "properly paired" alignments in the file. If any given read has multiple entries then it'll get counted multiple times....which is the source of the discrepancy you're seeing.

If I needed to report a number, I'd report the 92.2% overall alignment rate. You can probably get something similar with samtools view -c -F 260 accepted_hits.bam.

ADD COMMENTlink written 3.1 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour