tophat output files containing the reads which mapped uniquely as a pair
1
0
Entering edit mode
8.0 years ago
trakhtenberg ▴ 160

There is a file called align_summary.txt in the tophat folder (generated by running tophat) which says:

Left reads:
Input: 128979165
Mapped:  98314933 (76.2% of input)
of these:  11898655 (12.1%) have multiple alignments (9004 have >20)
Input: 128979165
Mapped:  95536410 (74.1% of input)
of these:  10769172 (11.3%) have multiple alignments (2289 have >20)
75.1% overall read alignment rate.
Aligned pairs:  92923521
of these:   8913959 ( 9.6%) have multiple alignments
and:   1899417 ( 2.0%) are discordant alignments
70.6% concordant pair alignment rate.


Does what it says at the end "70.6% concordant pair alignment rate" mean that 70.6% of pair-end reads mapped uniquely (single match) as a pair? And are these 70.6% of paired reads is what included in the accepted_hits.bam?

What about splice junctions which mapped uniquely to transcriptome (rather than genome), are they included in this 70.6%? In either case, does junctions.bed file contains splice junctions which mapped uniquely to transcriptome? Does 70.6% refer to both, reads uniquely mapped to genome and splice junctions uniquely mapped to transcriptome?

Would appreciate a clarification.

Thank you,
Ephraim Trakhtenberg

TOPHAT RNA-Seq • 10k views
4
Entering edit mode
8.0 years ago

A concordant alignment is defined as a pair on the same chromosome/contig with the proper orientation (typically pointing toward each other) with an appropriate distance between their extrema (due to size selection, though remember that the reasonableness of a distance is dependent on its transcript-space representation). Hopefully that was slightly clearer than mud.

So, this 70.6% number includes "unique" mappers and multi-mappers. These reads are among those included in the accepted_hits.bam file, though they won't be all of them. Any alignment produced is placed in accepted_hits.bam. This 70.6% is unrelated to splice junctions, novel or otherwise. The splice junctions are derived from looking at the alignments, but the junctions themselves wouldn't be stored in a BAM file. My guess is that you're trying to ask if tophat2 uses multimappers in finding splice junctions. I don't actually know the answer to that, though I would suspect not (it'd raise the false-positive rate).

Hopefully that clarifies things.

0
Entering edit mode

Thank you, this answers my question, to summarize: The last line in the align_summary.txt file is not a summary of all of the above but rather information specifically regarding the concordant alignment, including uniquely and multiply mapped reads pair. accepted_hits.bam does not contain exclusively uniquely mapped reads pairs. It is unclear whether the junctions.bed file contains only splice junctions that are uniquely mapped based on alignments or multimappers too. And it does not appear that the align_summary.txt provides any statistics on the mapping of splice junctions that is based on alignments. So this leads to other questions which I now posted here: Extracting from tophat outputs reads pairs and splice-junctions with a single best match