I have a question I am struggling with, for which I am sure there is a sensible explanation. I have run an RNASeq analysis with STAR using the GRCh37 reference genome. I used the SJ.out file to run some downstream analysis and have now detected some splice junction that interest me. Using the start and end positions provided in the SJ.out file I tried to find upstream and downstream exons flanking my splice junctions. For this purpose I parsed the Homo_sapiens.GRCh37.75.gtf file.
To my surprise I have some splice junctions that don't seem to have an upstream or downstream exon flanking them. To find the flanking exons I looked for upstream exons that end one base before the start of the splice junction and for downstream exons that begin one base after the end of the splice junction. For most (~85%) of the splice junctions I could find at least one upstream and downstream exon, but others don't have any exon in their close proximity, or have an upstream exon, but no downstream exon.
Below I provide some examples of splice junctions that don't have flanking exons:
SJ1 (no upstream or downstream exon):
Chromosome: 3, Start: 52027879, End: 52028055
SJ2: (no upstream exon, but has a downstream exon):
Chromosome: 11, Start: 61204812, End: 61205096
SJ3: (no downstream exon, but has an upstream exon):
Chromosome: 2, Start: 44121769, End: 44122506
I cannot find a reasonable explanation for this, as the splice junction information for STAR is provided in the form of the gtf-file that I am parsing. My understanding is that the start of the splice junction should be the start of an intron and the end of the splice junction marks the end of the intron. In each case an exon should follow before and after. Interestingly only SJ1 in the example above doesn't have either an upstream or a downstream exon. The rest seems to be flanked by an exon at least from one side.
IMPORTANT NOTE: All splice junctions are reported as annotated in the SJ.out file, meaning there are no de novo splice junctions.
I would greatly appreciate it, if someone could point out a logical error or provide a biological explanation for this problem. If you need any additional information, I will be happy to provide it.