Extracting from tophat outputs reads pairs and splice-junctions with a single best match
0
7
Entering edit mode
7.0 years ago
trakhtenberg ▴ 160

In the tophat folder (generated by running tophat), there is a file called align_summary.txt. In the example below, it reports that out of 92923521 aligned pairs, 9.6% have multiple alignments, which means that there must be a way to generate a file containing only the uniquely aligned pairs (those that have only a single match as a pair). The accepted_hits.bam file, if I understand correctly, contains both, unique and multiple alignments pairs. Which command can I use to extract only the uniquely aligned pairs?

Also, if the junctions.bed file contains splice junctions which uniquely mapped based on alignments as well as multimappers, which command can I use to extract only the uniquely aligned? And since this is .bed rather than .bam format, is the pairing information utilized for determining the uniqueness of the splice junction read?

In the following 3 posts the code 255 (in both accepted_hits.bam and junctions.bed) is explained as either meaning that the mapping quality score is not available or that read has a single match in the reference: Tophat Rna-Seq Mapping Quality, https://biostar.usegalaxy.org/p/4077, Junctions.Bed File Produced From Tophat --- what is the final word on this? And if it means single match, does it refer to reads independently of pairing or also if as a pair the match is unique even if each read on its own could have more matches?

Another solution for accepted_hits.bam file is: samtools view -bq 1 file.bam > unique.bam, where unique reads = single read mapping at one best position, suggested in the following posts: A: How To Filter Mapped Reads With Samtools and http://seqanswers.com/forums/archive/index.php/t-6189.html --- here too, if as a pair the match is unique, even if each read on its own could have more matches, is it accepted as unique?

Alternatively, the following two posts suggest that using -g/--max-multihits 1 when running tophat will generate accepted_hits.bam file with reads that have only a single match in the genome, though I do not know whether this command will also include reads which have only 1 match as a pair but not each on its own? What Does The Tophat File Named 'Accepted_Hits.Bam' Include? and http://seqanswers.com/forums/showthread.php?t=8548

I assume that for all of the above, for a single match in the reference the reads/pairs tophat alignment run need to be set to disallow any mismatches, because if allowed, then more would get multiple matches and get excluded. On the other hand, if read/pair has 0 matches, it is preferable to allow at least 1 mismatch. How could such combined approach be implemented? The following post discusses the thresholds for alignment quality and I wonder if this could be used along with other approaches, so that reads pairs that have no matches would be allowed to have one mismatch? C: How to extract unique mapped results from Bowtie2 bam results? samtools view -b -q 10 foo.bam > foo.filtered.bam, where -q value represents the likelihood that an alignment is correct.

I do not have the expertise to evaluate all these approaches, and would appreciate advice. Thank you.

Input: 128979165

Mapped:  98314933 (76.2% of input)

of these:  11898655 (12.1%) have multiple alignments (9004 have >20)

Input: 128979165

Mapped:  95536410 (74.1% of input)

of these:  10769172 (11.3%) have multiple alignments (2289 have >20)

75.1% overall read alignment rate.

Aligned pairs:  92923521

of these:   8913959 ( 9.6%) have multiple alignments

and:   1899417 ( 2.0%) are discordant alignments

70.6% concordant pair alignment rate.

RNA-Seq TOPHAT • 8.8k views
1
Entering edit mode

Regarding the BAM file, the newer versions of tophat2 will give a score of 50 for alignments that don't multimap (this used to be 255). Yes, 255 is defined in the specification as "unavailable", but what the specification says and people do aren't always the same. So, just samtools view -bq 4 accepted_hits.bam > filtered.bam to filter things. Note that -q 1 won't work, since tophat2 will give MAPQ scores of between 0 and 3 to multimappers (depending on how many times they map).

I don't know about the BED files, so I'm just making this a comment to not dissuade others from answering.

BTW, you might switch to STAR. The author is pretty quick with responses over on seqanswers.

1
Entering edit mode

So,

samtools view -bq 4 accepted_hits.bam > filtered.bam


will select only unique mapping reads from the tophat2 accepted_hits.bam.

While

samtools view -bq 1


works for bwa file to select unique mapping reads.

1
Entering edit mode

Yup!

0
Entering edit mode

Thanks!

How do you filter discordant reads from filtered.bam. For eg: in the align_summary.txt provided by trakhtenberg, there are about 2% of discordant reads. Tophat with options --no-discordant always fails, probably need to do through samtools. Any suggestions. Thanks!

0
Entering edit mode

You want bit 2 to be set.

0
Entering edit mode

"You want bit 2 to be set". You mean during tophat or samtools ? Could you please show an example command.

thanks!

0
Entering edit mode

With samtools, so samtools view -bq 4 -f 2 ....

0
Entering edit mode

thank you. I also found on this post http://seqanswers.com/forums/newreply.php?do=newreply&p=108623 that the reads from junctions.bed are actually included in the accepted_hits.bam (tagged by 'N' characters in the CIGAR string, 6th column), so I only need to filter the BAM file.

Although not critical to this discussion, I have another question which may influence the identification of uniquely aligned reads, and I posted it here How to identify with Tophat maximum number of uniquely aligned reads by allowing mismatches only conditionally?

1
Entering edit mode

An N CIGAR operation means spliced, so that's correct. BTW, if you were to use STAR, you would find that it annotates its junctions with the number of uniquely aligned reads supporting them (it also gives you total reads). This is convenient for filtering novel junctions.

0
Entering edit mode

thank you for clarifying the meaning of N CIGAR. I will soon start practicing with STAR too.