Question: Extracting from tophat outputs reads pairs and splice-junctions with a single best match
7
gravatar for trakhtenberg
4.7 years ago by
trakhtenberg150
United States
trakhtenberg150 wrote:

In the tophat folder (generated by running tophat), there is a file called align_summary.txt. In the example below, it reports that out of 92923521 aligned pairs, 9.6% have multiple alignments, which means that there must be a way to generate a file containing only the uniquely aligned pairs (those that have only a single match as a pair). The accepted_hits.bam file, if I understand correctly, contains both, unique and multiple alignments pairs. Which command can I use to extract only the uniquely aligned pairs?

Also, if the junctions.bed file contains splice junctions which uniquely mapped based on alignments as well as multimappers, which command can I use to extract only the uniquely aligned? And since this is .bed rather than .bam format, is the pairing information utilized for determining the uniqueness of the splice junction read?

In the following 3 posts the code 255 (in both accepted_hits.bam and junctions.bed) is explained as either meaning that the mapping quality score is not available or that read has a single match in the reference: Tophat Rna-Seq Mapping Quality, https://biostar.usegalaxy.org/p/4077, Junctions.Bed File Produced From Tophat --- what is the final word on this? And if it means single match, does it refer to reads independently of pairing or also if as a pair the match is unique even if each read on its own could have more matches?

Another solution for accepted_hits.bam file is: samtools view -bq 1 file.bam > unique.bam, where unique reads = single read mapping at one best position, suggested in the following posts: A: How To Filter Mapped Reads With Samtools and http://seqanswers.com/forums/archive/index.php/t-6189.html --- here too, if as a pair the match is unique, even if each read on its own could have more matches, is it accepted as unique?

Alternatively, the following two posts suggest that using -g/--max-multihits 1 when running tophat will generate accepted_hits.bam file with reads that have only a single match in the genome, though I do not know whether this command will also include reads which have only 1 match as a pair but not each on its own? What Does The Tophat File Named 'Accepted_Hits.Bam' Include? and http://seqanswers.com/forums/showthread.php?t=8548

I assume that for all of the above, for a single match in the reference the reads/pairs tophat alignment run need to be set to disallow any mismatches, because if allowed, then more would get multiple matches and get excluded. On the other hand, if read/pair has 0 matches, it is preferable to allow at least 1 mismatch. How could such combined approach be implemented? The following post discusses the thresholds for alignment quality and I wonder if this could be used along with other approaches, so that reads pairs that have no matches would be allowed to have one mismatch? C: How to extract unique mapped results from Bowtie2 bam results? samtools view -b -q 10 foo.bam > foo.filtered.bam, where -q value represents the likelihood that an alignment is correct.

I do not have the expertise to evaluate all these approaches, and would appreciate advice. Thank you.

Left reads:

              Input: 128979165

              Mapped:  98314933 (76.2% of input)

             of these:  11898655 (12.1%) have multiple alignments (9004 have >20)

Right reads:

              Input: 128979165

              Mapped:  95536410 (74.1% of input)

             of these:  10769172 (11.3%) have multiple alignments (2289 have >20)

75.1% overall read alignment rate.

Aligned pairs:  92923521

     of these:   8913959 ( 9.6%) have multiple alignments

          and:   1899417 ( 2.0%) are discordant alignments

70.6% concordant pair alignment rate.

rna-seq tophat • 6.7k views
ADD COMMENTlink modified 4.7 years ago by Biostar ♦♦ 20 • written 4.7 years ago by trakhtenberg150
1

Regarding the BAM file, the newer versions of tophat2 will give a score of 50 for alignments that don't multimap (this used to be 255). Yes, 255 is defined in the specification as "unavailable", but what the specification says and people do aren't always the same. So, just samtools view -bq 4 accepted_hits.bam > filtered.bam to filter things. Note that -q 1 won't work, since tophat2 will give MAPQ scores of between 0 and 3 to multimappers (depending on how many times they map).

I don't know about the BED files, so I'm just making this a comment to not dissuade others from answering.

BTW, you might switch to STAR. The author is pretty quick with responses over on seqanswers.

ADD REPLYlink written 4.7 years ago by Devon Ryan89k
1

So, "samtools view -bq 4 accepted_hits.bam > filtered.bam" will select only unique mapping reads from the tophat2 accepted_hits.bam.

While samtools view -bq 1 works for bwa file to select unique mapping reads.

ADD REPLYlink written 3.5 years ago by Chirag Nepal2.2k
1

Yup!                      

ADD REPLYlink written 3.5 years ago by Devon Ryan89k

Thanks !

How do you filter discordant reads from filtered.bam. For eg: in the align_summary.txt provided by trakhtenberg, there are about 2% of discordant reads. Tophat with options --no-discordant always fails, probably need to do through samtools. Any suggestions.  Thanks !

ADD REPLYlink written 3.5 years ago by Chirag Nepal2.2k

You want bit 2 to be set.

ADD REPLYlink written 3.5 years ago by Devon Ryan89k

"You want bit 2 to be set." You mean during tophat or samtools ? Could you please show an example command. 

thanks !

ADD REPLYlink written 3.5 years ago by Chirag Nepal2.2k

With samtools, so samtools view -bq 4 -f 2 ....

ADD REPLYlink written 3.4 years ago by Devon Ryan89k

thank you. I also found on this post http://seqanswers.com/forums/newreply.php?do=newreply&p=108623 that the reads from junctions.bed are actually included in the accepted_hits.bam (tagged by 'N' characters in the CIGAR string, 6th column), so I only need to filter the BAM file.

Although not critical to this discussion, I have another question which may influence the identification of uniquely aligned reads, and I posted it here How to identify with Tophat maximum number of uniquely aligned reads by allowing mismatches only conditionally?

ADD REPLYlink written 4.6 years ago by trakhtenberg150
1

An N CIGAR operation means spliced, so that's correct. BTW, if you were to use STAR, you would find that it annotates its junctions with the number of uniquely aligned reads supporting them (it also gives you total reads). This is convenient for filtering novel junctions.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

thank you for clarifying the meaning of N CIGAR. I will soon start practicing with STAR too.

ADD REPLYlink written 4.6 years ago by trakhtenberg150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1713 users visited in the last hour