Tophat - Expectancy Of Properly Paired Reads?
2
4
Entering edit mode
9.1 years ago
Johan ▴ 870

I started working with RNA sequence data for the first time. I have aligned a pair-end library using tophat to my reference using tophat with standard parameters. When I looked at the basic statistics for the results using samtools flagstat I noticed that ~75% of the reads were properly paired. In my ears that sounds somewhat low. Would any one with more experience in the matter care to comment on this?

Further more I tried to raise this number by specifying the -r/--mate-inner-dist <int> option. I have an estimated fragment length of 160 bp, and I therefore set it to -40 (160 - 100*2). However this landed me at properly paired percentage of ~57%. Am I doing something wrong with my calculation here?

tophat rna-seq • 5.8k views
1
Entering edit mode

75% properly paired is not low for RNA-seq in my opinion. As Ketil wrote: "Also, when the pair spans an intron, your aligner might decide they are too far apart to be tagged as "properly" paired" and this is in fact happens quite often.

0
Entering edit mode

It happens often for complex eukaryotes, but rarely for bacteria - so this depends a lot on the species under study. I think in most cases, an intron might be acceptable, at least for BWA alignment (details in my answer).

3
Entering edit mode
9.1 years ago

Your calculation of '--mate-inner-distance' seems correct, but assuming your reads are 2x100bp why was the fragment size selected during library construction so small? What does the shape of the distribution of sizes look like? If your distribution of fragment sizes is not what you expect, this could partially explain the lower rate of properly paired reads.

You could increase the '--mate-std-dev' to relax the stringency of expected fragment size. You could also pull out some of your reads that are not properly paired and map them by some other strategy (e.g. blat) to see what is going on.

You can use tools like Picard CollectRnaSeqMetrics and RNA-SeQC to identify possible quality issues with your data.

1
Entering edit mode
9.1 years ago
Ketil 4.1k

You don't say what genome this is, so one reason for low "properly paired" proportion could be a fragmented reference. Also, when the pair spans an intron, your aligner might decide they are too far apart to be tagged as "properly" paired.

I'm not familiar with the -r option, but this is often measured as insert (fragment) length, so I think you should use 160 here.

There are also other tools that will provide more elaborate statistics on insert lengths and pairing.

Edit: looking at some numbers from one project here (salmon louse, a crustacean), I find 91% mapped (according to samtools flagstat), and 89% properly paired. So I think 75% seems low, indicating but it depends on how many reads you map in the first place.

Samtools flagstat output:

15358792 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14047584 + 0 mapped (91.46%:-nan%)
15358792 + 0 paired in sequencing
13659830 + 0 properly paired (88.94%:-nan%)
13796301 + 0 with itself and mate mapped
251283 + 0 singletons (1.64%:-nan%)
58802 + 0 with mate mapped to a different chr
49248 + 0 with mate mapped to a different chr (mapQ>=5)


Output from bam stats:

## Input file: tmp/110901_SN865_A_L006_R1_GLW-45N.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                 6856689  89.29%    193.2   953.4   316.4 129128.3
outies                   10748   0.14%   1315.9  7899.1    26.3  1134.5
lefties                    886   0.01%  24816.6 61173.4     3.3     9.9
righties                   790   0.01%  32320.9 68283.9     3.1     9.8

unmapped:    1311203 (8.5%)
orphans:      251283 (1.6%)
split pairs:   29401 (0.4%)


From this, it doesn't look like introns make a big difference, bamstats will count any pair with correct orientation as an "innie", and there's only 50K that are recognized as innies that are not "proper pairs" according to the aligner. But note that "proper pair" is set by the alignment software, and different aligners may have different opinions of propriety.