Samtools Count Paired-End Reads
Entering edit mode
10.1 years ago
dfernan ▴ 730


I used tophat to align paired-end reads from an rna-seq experiment and I obtained an accepted_hits.bam alignment file.

Using the accepted_hits.bam I'd like to count the number of properly aligned paired-end reads (i.e. both ends are aligned and the alignment make sense - something like that, not sure on the best definition here).

How should I count the number of properly-aligned paired-end fragments? I'd like to have the count in terms of fragments instead of reads (i.e., each fragment should be counted once if both paired end reads aligned properly).

Suggestions are welcome, I am new to processing paired-end data so thought asking here.

I just tried this:

Sample_5> samtools flagstat accepted_hits.bam 
46656617 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
46656617 + 0 mapped (100.00%:nan%)
46656617 + 0 paired in sequencing
24411125 + 0 read1
22245492 + 0 read2
19907436 + 0 properly paired (42.67%:nan%)
36273430 + 0 with itself and mate mapped
10383187 + 0 singletons (22.25%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

there seem to be some issue on how tophat reports the alignments but not sure here:

Also, let me see if I understand this output correctly: seems there are 46 MM reads (that is counting reads from end1 + reads from end2?)


paired-end samtools tophat • 12k views
Entering edit mode

The flag field in the BAM file will contain the information you want. It is worth your time to read the SAM specifications: Look @ 2. FLAG: bitwise FLAG. Each bit is explained in the following table:

Entering edit mode

thanks! that is assuming tophat correctly uses such flag, i am looking more for experience here, seems that's not always the case... here they mention explicitly the following: According to an unanswered SEQAnswers question , TopHat might not be setting this flag correctly or perhaps this flag is used differently in TopHat...

Entering edit mode
10.1 years ago

Depending on your needs you should be using the following flags to extract the reads from bam files. Check samtools manual to learn how to extract the reads or there are many posts on biostar about the same.

99 - read paired, read mapped in proper pair, mate reverse strand, first in pair, 163 - read paired, read mapped in proper pair, mate reverse strand, second in pair , 83 and 147, (same as above i.e.properly paired reads but they map in the reverse)

I dont know how Tophat treats the reads that are aligned properly in terms of orientation but not in terms of the insert distance. It may happen that the original insert size between reads is 300 bp but these reads may part of different exons that tend to lie far away on the reference genome because there is a large intron between them. In this case, though the reads are paired properly or mapped properly but they may be flagged as improper by Tophat.

I wont considering removal of unpaired reads as the proper pairing can fail for may reasons: one of the pair maps in the repeat regions or sequencer messed up one of the reads in the pair etc. In case you need them, you should also include following flags:

73 - read paired, mate unmapped, first in pair, 137 - read paired, mate unmapped, second in pair, 89, 153 same as above but they map in the reverse.

Entering edit mode

thanks a lot, v helpful!


Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6