Question: tophat output files containing the reads which mapped uniquely as a pair
0
gravatar for trakhtenberg
4.5 years ago by
trakhtenberg150
United States
trakhtenberg150 wrote:

There is a file called align_summary.txt in the tophat folder (generated by running tophat) which says:

Left reads:

              Input: 128979165

              Mapped:  98314933 (76.2% of input)

             of these:  11898655 (12.1%) have multiple alignments (9004 have >20)

Right reads:

              Input: 128979165

              Mapped:  95536410 (74.1% of input)

             of these:  10769172 (11.3%) have multiple alignments (2289 have >20)

75.1% overall read alignment rate.

Aligned pairs:  92923521

     of these:   8913959 ( 9.6%) have multiple alignments

          and:   1899417 ( 2.0%) are discordant alignments

70.6% concordant pair alignment rate.

Does what it says at the end “70.6% concordant pair alignment rate” mean that 70.6% of pair-end reads mapped uniquely (single match) as a pair? And are these 70.6% of paired reads is what included in the accepted_hits.bam?

What about splice junctions which mapped uniquely to transcriptome (rather than genome), are they included in this 70.6%? In either case, does junctions.bed file contains splice junctions which mapped uniquely to transcriptome? Does 70.6% refer to both, reads uniquely mapped to genome and splice junctions uniquely mapped to transcriptome?

Would appreciate a clarification.

Thank you,
Ephraim Trakhtenberg

rna-seq tophat • 7.4k views
ADD COMMENTlink modified 4.5 years ago by Devon Ryan88k • written 4.5 years ago by trakhtenberg150
4
gravatar for Devon Ryan
4.5 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

A concordant alignment is defined as a pair on the same chromosome/contig with the proper orientation (typically pointing toward each other) with an appropriate distance between their extrema (due to size selection, though remember that the reasonableness of a distance is dependent on its transcript-space representation). Hopefully that was slightly clearer than mud.

So, this 70.6% number includes "unique" mappers and multi-mappers. These reads are among those included in the accepted_hits.bam file, though they won't be all of them. Any alignment produced is placed in accepted_hits.bam. This 70.6% is unrelated to splice junctions, novel or otherwise. The splice junctions are derived from looking at the alignments, but the junctions themselves wouldn't be stored in a BAM file. My guess is that you're trying to ask if tophat2 uses multimappers in finding splice junctions. I don't actually know the answer to that, though I would suspect not (it'd raise the false-positive rate).

Hopefully that clarifies things.
 

ADD COMMENTlink written 4.5 years ago by Devon Ryan88k

Thank you, this answers my question, to summarize: The last line in the align_summary.txt file is not a summary of all of the above but rather information specifically regarding the concordant alignment, including uniquely and multiply mapped reads pair. Accepted_hits.bam does not contain exclusively uniquely mapped reads pairs. It is unclear whether the junctions.bed file contains only splice junctions that are uniquely mapped based on alignments or multimappers too. And it does not appear that the align_summary.txt provides any statistics on the mapping of splice junctions that is based on alignments. So this leads to other questions which I now posted here: Extracting from tophat outputs reads pairs and splice-junctions with a single best match

ADD REPLYlink written 4.5 years ago by trakhtenberg150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour