Annotating RNA-Seq exon junctions
1
1
Entering edit mode
7.1 years ago

Hi,

This may be a simple question but as far as I can tell the current tools don't quite do this. We have been using target RNA-seq to test the Junction abundance of certain genes, sadly due to probe limitations we were only able to do the junctions, very little of the exons are read and this causes difficult for many annotating tools (or I'm not using them correctly).

Basically all I want to do is take my genome coordinates from my alignment (STAR-2pass) and then annotate them as junctions (not transcripts). i.e. 1-2 or 1-3 exon junctions etc. Are there any tools out there that people know of that will annotate all the unique reads to exons they are contained in?

If not I will have to write some script for this, which will take me some time, and my be useful (limited) if no tools exists currently.

RNA-Seq Exon-Junctions STAR • 3.3k views
1
Entering edit mode

One of the output-files of STAR is the *SJ.out.tab which contains the position and inter alia the number of uniquely mapping reads crossing the junction. You can easily re-format the table to a bed format.

0
Entering edit mode

I must not have been clear in my original description. I want to use the coordinates from the SJ.out.tab file and use the start coordinate to return (from a GTF file) that exon it is anchored (i.e exon2) and then do the same for the stop coordinate. Basically I want to change my SJ.out.file to have these extra columns:

StartExon StartGene StopExon StopGene


I have included that gene as a way to to a quick a very dirty check. There may be tweaks on this.

0
Entering edit mode

This is quite complex due to the fact that an exon cannot only be identified by one junction. Please take a look at e.g. ApoE detecting the last junction does not tell you which last exon is used or even the ratios.

You can try to run Cufflinks or StringTie on the reads and see if they can separate the transcripts by the junctions.

Alternatively, you need for identifying two exons three junctions:

 Junction1 -- Exon1_Start - Exon1_End -- Junction2 - Exon2_Start - Exon2_End -- Junction3


Thus, you need to check for each exon if it is flanked by junction reads. Afterwards, you can assemble the selected exons by the SJ.out.tab file.

0
Entering edit mode

Yes I am beginning to find this more and more complicated the further I get into this. I will create and graphic to help explain this in a bit more detail- We have used the TruSeq Targeted RNA sequencing (illumina) where we have selected the probes for each junctions of our genes of interest. This gives us only reads that span junctions, and very little of each exon. So low coverage, which makes it an issue for cufflinks. What I am going to have to do, is get from (likely the SAM file) the first base (chromosomal location) of the read and the last base before the split. Use this information against a GTF file to see which exon it maps to (become difficult with multiple transcripts) and then do the same for the second half of the read.

I was hoping, smarter people then I have already developed a tool for this, but because of the very specific nature of this it does not appear so. If you have any better ideas please do tell me, this method (target RNA seq) is all very new too me.

0
Entering edit mode
7.1 years ago
Martombo ★ 3.0k

I've got something in python that could help you on my github. You can use the method get_splice_sites, of a Bam object. It returns a dictionary with a string defining the junction and the number of counts in the bam file. At the moment this works on any splice site present in the data, it is not based on a gtf file. You should then parse the site to assign it to a gene, eventually. To create a Bam object you just have to supply the path to it and indicate the reads_orientation.

0
Entering edit mode

ps: if you run tophat this information is already in the output (junctions.bed)

0
Entering edit mode

I believe the junctions.bed is similar to the SJ.out.tab in which you get the coordinates for each junction site, but no information on the actual exonic (i.e. START:genex -exon2 STOP:genex-exon4) location of the start and stop of the read?

0
Entering edit mode

ok I might still have something useful. one thing is not completely clear to me: do you want to have the exon coordinates or the read ones (as you write in your last comment)? The former can be achieved from a gtf file with this function, but you would still need some parsing to join the information with the counts across the junction.