Question: Exons, Strands and Junctions
gravatar for luisccleto
3.4 years ago by
luisccleto30 wrote:


I'm getting started in the field of bioinformatics and I'm developing a tool to analyze aligned reads and identify splice junctions and their type (within some options).

Essentially, I have a GTF file from which I load all exons and genes along with their genomic coordinates. Additionally, I have several BED files with junctions extracted from BAM files via samtools, awk and bamToBed.

While analysing results I had several numbers that seemed accurate but several others that did not make sense: A huge amount of junctions outside any known genes that I detected as completely novel junctions (didn't map to any known exon). There was also a large number that seemed to fit in both alternative 5 prime and alternative 3 prime simultaneously.

I decided to manually analyze some of these cases and I verified that there are indeed exons that would "fit" and make these junctions be classified as canonical. However, they are on the opposite strand of that of the junction. When matching exons or genes to positions I (currently) only consider those on the same chromosome AND strand. However, it seems the data would make more sense if I excluded the strand information altogether. However, if I'm not mistaken, in order to classify a junction as alt 3 prime or alt 5 prime I also need the strand information (and it seems there are many duplicate junctions on opposite strands)

As I am new in this field I am not certain of what would be the best way to proceed and I was hoping for some explanations for why this happens (or at least pointers in the right direction).

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by luisccleto30

Is the dataset you're basing this off of stranded? What is the source of the GTF you're using. In general, the annotations available are somewhat incomplete, so finding random things that appear to be spliced genes outside of annotated genes will happen (this are sometimes repeat elements, so have a look at a repeatmasker track).

ADD REPLYlink written 3.4 years ago by Devon Ryan92k

Hey. Thanks for the quick reply!

The samples were provided by a research facility next to my campus and, from what I was able to uderstand by analysing the BAM files, they are (cancerous) thyroid tissue samples aligned with TopHat 1.4.1 and the strand information is available in the BAM files (hence extracting with bamToBed also outputting strand information for each read). I'm using hg19 from gencodegenes as my GTF reference as the reads were aligned with that same version.

I did expect to see some of those random things, specially for rare reads which were probably artifacts. What astonished me was the sheer amount of those events. I detected about as many novel junctions as I did canonical junctions (ignoring junctions with few supporting reads). This only changed once I included exons from both strands of a chromosome when searching.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by luisccleto30

I don't mean the orientation of the reads (what's available in the BAM files), I mean whether the libraries were prepared with a stranded protocol. If they weren't, then you can't infer strand information from the BAM file. Align with hg38 and use the most recent gencode annotation.

ADD REPLYlink written 3.4 years ago by Devon Ryan92k

Thanks! I thought since the strand information was present in the BED file that there were no issues with that. But I analyzed the BAM files again as they contain the command that was used to run the aligner and it seems that a stranded protocol was not used (at least judging by the options used to run tophat)!

I'll try to redo the previous steps to include strand information but for now I think I'll just add an option to my script to allow both stranded and unstranded analysis.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by luisccleto30

You might just run RSeQC with the "infer experiment type" (or whatever it's called) module and see what it says.

ADD REPLYlink written 3.3 years ago by Devon Ryan92k

On a related note, may I ask why you're doing this from scratch,please? GATK's Split'N'Trim does this for you.

ADD REPLYlink written 3.4 years ago by RamRS24k

It's a university project for one of my classes. EDIT: further information - as I come from a programming background, the scripts themselves are not too challenging. The goal is to learn more about the intricacies of the biological part of bioinformatics and the underlying concepts but within a very restricted time frame through a learn-by-doing approach.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by luisccleto30

Ah, I see. Good luck with the design, then!

ADD REPLYlink written 3.3 years ago by RamRS24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 753 users visited in the last hour