I'm getting started in the field of bioinformatics and I'm developing a tool to analyze aligned reads and identify splice junctions and their type (within some options).
Essentially, I have a GTF file from which I load all exons and genes along with their genomic coordinates. Additionally, I have several BED files with junctions extracted from BAM files via samtools, awk and bamToBed.
While analysing results I had several numbers that seemed accurate but several others that did not make sense: A huge amount of junctions outside any known genes that I detected as completely novel junctions (didn't map to any known exon). There was also a large number that seemed to fit in both alternative 5 prime and alternative 3 prime simultaneously.
I decided to manually analyze some of these cases and I verified that there are indeed exons that would "fit" and make these junctions be classified as canonical. However, they are on the opposite strand of that of the junction. When matching exons or genes to positions I (currently) only consider those on the same chromosome AND strand. However, it seems the data would make more sense if I excluded the strand information altogether. However, if I'm not mistaken, in order to classify a junction as alt 3 prime or alt 5 prime I also need the strand information (and it seems there are many duplicate junctions on opposite strands)
As I am new in this field I am not certain of what would be the best way to proceed and I was hoping for some explanations for why this happens (or at least pointers in the right direction).