Hi,
I have an RNA-seq data set with nuclear and total (whole cell) RNA under multiple conditions. I first aligned them with STAR and now I'm trying to quantify the presence of different transcript regions (e.g. introns, CDS, 3UTR etc.) in my reads. For this, I ran featureCounts multiple times on different GTF.feature types (see example below).
featureCounts(files = bam.list,annot.ext = "Araport11_GTF_genes_transposons.Jan2023.gtf", isGTFAnnotationFile = TRUE, GTF.featureType = "CDS", GTF.attrType = "gene_id", countMultiMappingReads = TRUE, fraction = TRUE, allowMultiOverlap = FALSE, isPairedEnd = TRUE, strandSpecific = 2, useMetaFeatures = TRUE, nthreads = 20) #do the same for GTF.featureType "intron", "three_prime_utr", "five_prime_utr, "exon"
And also ran featureCounts on the full gene to calculate relative abundances of each transcript region:
featureCounts(files = bam.list,annot.ext = "Araport11_GTF_genes_transposons.Jan2023.gtf", isGTFAnnotationFile = TRUE, GTF.featureType = "gene", GTF.attrType = "gene_id", countMultiMappingReads = TRUE, fraction = TRUE, allowMultiOverlap = FALSE, isPairedEnd = TRUE, strandSpecific = 2, useMetaFeatures = TRUE, nthreads = 20)
However, when I looked at the relative abundance of each region (e.g. counts per CDS / counts per gene), I noticed that for quite some genes something strange is happening. The gene GTF feature should contain all counts from the other transcript regions, but for some genes the counts in e.g. the CDS are a lot higher than the counts in the gene. See image below:
I checked the count data for one of those genes, and indeed the counts in both the exon and gene were really low, but they were high for the 3UTR, 5UTR and CDS.
I cannot find an error in my GTF annotation file:
And on IGV, the reads seem to truly align to the gene:
Example of a BAM file that is counted for the CDS, but not for the gene:
Any idea what might be going on here??