I have a set of Illumina short read RNA-seq files which I'm hoping to use to analyse exon usage patterns within a set of specific genes.
The workflow is fastqc -> quality trimming and adapter removal -> alignment with STAR -> sorting and indexing to produce BAM file -> count alignments to annotated exons using featureCounts in subread.
However, I'm running into issues as the GENCODE annotation (GTF) file I used for alignment and feature counting (Release 33 (GRCh38.p13) Comprehensive gene annotation) has:
- instances where the same genomic region is annotated with different ENSEMBL exon IDs; and
- instances where two different exon annotations overlap one another for the same gene (presumably varying 5' and 3' splice sites??)
In any event, in order to do feature counting properly I need a non-redundant and non-overlapping set of exon annotations. This could possibly be the set of exons in the inferred metatranscript of the gene, or the set of exons where each exon is the largest possible set of overlapping genomic coordinates that are annotated as exons. However, I'm not quite sure how to go about producing something like this, and at which point to incorporate such an annotation into my analyses.
If I attempt to reinterpret the featureCounts table with new coordinates after it's already been produced, it's possible I'll miss some reads that haven't been counted.
I'd really appreciate some help in terms of how to go about doing this. Thanks!