I used tophat to align paired-end reads from an rna-seq experiment and I obtained an accepted_hits.bam alignment file.
Using the accepted_hits.bam I'd like to count the number of properly aligned paired-end reads (i.e. both ends are aligned and the alignment make sense - something like that, not sure on the best definition here).
How should I count the number of properly-aligned paired-end fragments? I'd like to have the count in terms of fragments instead of reads (i.e., each fragment should be counted once if both paired end reads aligned properly).
Suggestions are welcome, I am new to processing paired-end data so thought asking here.
I just tried this:
Sample_5> samtools flagstat accepted_hits.bam 46656617 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 46656617 + 0 mapped (100.00%:nan%) 46656617 + 0 paired in sequencing 24411125 + 0 read1 22245492 + 0 read2 19907436 + 0 properly paired (42.67%:nan%) 36273430 + 0 with itself and mate mapped 10383187 + 0 singletons (22.25%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
there seem to be some issue on how tophat reports the alignments but not sure here: http://seqanswers.com/forums/showthread.php?t=8186.
Also, let me see if I understand this output correctly: seems there are 46 MM reads (that is counting reads from end1 + reads from end2?)
The flag field in the BAM file will contain the information you want. It is worth your time to read the SAM specifications: http://samtools.sourceforge.net/SAM1.pdf. Look @ 2. FLAG: bitwise FLAG. Each bit is explained in the following table:
thanks! that is assuming tophat correctly uses such flag, i am looking more for experience here, seems that's not always the case... http://vallandingham.me/RNA_seq_differential_expression.html here they mention explicitly the following: According to an unanswered SEQAnswers question , TopHat might not be setting this flag correctly or perhaps this flag is used differently in TopHat...