Question

Exons, Strands and Junctions

1

Entering edit mode

7.9 years ago

luisccleto ▴ 30

Hello!

I'm getting started in the field of bioinformatics and I'm developing a tool to analyze aligned reads and identify splice junctions and their type (within some options).

Essentially, I have a GTF file from which I load all exons and genes along with their genomic coordinates. Additionally, I have several BED files with junctions extracted from BAM files via samtools, awk and bamToBed.

While analysing results I had several numbers that seemed accurate but several others that did not make sense: A huge amount of junctions outside any known genes that I detected as completely novel junctions (didn't map to any known exon). There was also a large number that seemed to fit in both alternative 5 prime and alternative 3 prime simultaneously.

I decided to manually analyze some of these cases and I verified that there are indeed exons that would "fit" and make these junctions be classified as canonical. However, they are on the opposite strand of that of the junction. When matching exons or genes to positions I (currently) only consider those on the same chromosome AND strand. However, it seems the data would make more sense if I excluded the strand information altogether. However, if I'm not mistaken, in order to classify a junction as alt 3 prime or alt 5 prime I also need the strand information (and it seems there are many duplicate junctions on opposite strands)

As I am new in this field I am not certain of what would be the best way to proceed and I was hoping for some explanations for why this happens (or at least pointers in the right direction).

RNA-Seq junctions alternative splicing strands • 2.3k views

ADD COMMENT • link 7.9 years ago by luisccleto ▴ 30

1

Entering edit mode

Is the dataset you're basing this off of stranded? What is the source of the GTF you're using. In general, the annotations available are somewhat incomplete, so finding random things that appear to be spliced genes outside of annotated genes will happen (this are sometimes repeat elements, so have a look at a repeatmasker track).

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

1

Entering edit mode

Hey. Thanks for the quick reply!

The samples were provided by a research facility next to my campus and, from what I was able to uderstand by analysing the BAM files, they are (cancerous) thyroid tissue samples aligned with TopHat 1.4.1 and the strand information is available in the BAM files (hence extracting with bamToBed also outputting strand information for each read). I'm using hg19 from gencodegenes as my GTF reference as the reads were aligned with that same version.

I did expect to see some of those random things, specially for rare reads which were probably artifacts. What astonished me was the sheer amount of those events. I detected about as many novel junctions as I did canonical junctions (ignoring junctions with few supporting reads). This only changed once I included exons from both strands of a chromosome when searching.

ADD REPLY • link 7.9 years ago by luisccleto ▴ 30

1

Entering edit mode

I don't mean the orientation of the reads (what's available in the BAM files), I mean whether the libraries were prepared with a stranded protocol. If they weren't, then you can't infer strand information from the BAM file. Align with hg38 and use the most recent gencode annotation.

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks! I thought since the strand information was present in the BED file that there were no issues with that. But I analyzed the BAM files again as they contain the command that was used to run the aligner and it seems that a stranded protocol was not used (at least judging by the options used to run tophat)!

I'll try to redo the previous steps to include strand information but for now I think I'll just add an option to my script to allow both stranded and unstranded analysis.

ADD REPLY • link 7.9 years ago by luisccleto ▴ 30

0

Entering edit mode

You might just run RSeQC with the "infer experiment type" (or whatever it's called) module and see what it says.

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

0

Entering edit mode

On a related note, may I ask why you're doing this from scratch,please? GATK's Split'N'Trim does this for you.

ADD REPLY • link 7.9 years ago by Ram 43k

1

Entering edit mode

It's a university project for one of my classes. EDIT: further information - as I come from a programming background, the scripts themselves are not too challenging. The goal is to learn more about the intricacies of the biological part of bioinformatics and the underlying concepts but within a very restricted time frame through a learn-by-doing approach.