Question

Exon vs Transcript in featurecounts for RNA Seq

0

Entering edit mode

12 months ago

murphytho1401 • 0

I am counting reads with featurecounts after aligning RNA-seq data to Hg38 using STAR. When I run featurecounts with “-t exon” (i.e. using the exon flag in the 3rd column of the GTF file), I get generally poor results, with less than 50% being assigned and greater than 50% being unnassigned_NoFeatures. However, when I specify “-t transcript” (i.e. using the transcript flag instead of exon), I get significantly more assigned reads (maybe around 70%) and much fewer unassigned_NoFeatures.

I am essentially wondering if there’s anything wrong with using -t transcript instead of -t exon. Of course, I understand the theoretical biological difference between a transcript and an exon, but is it so awful to use the transcript rows in the GTF instead of the exon rows? In theory, doesn’t using the transcript flag tell you more about transcription than the exon flag does?

In addition, what could be causing such a huge difference? Using the transcript flag instead of the exon flag gets me double or more than double the number of reads assigned per sample (from ~27 million to ~56 million in one sample, for example. Does this simply mean I have a ton of intronic reads in my data? Or is something else driving this?

As some extra info, I already checked rRNA genes. They seem to be annotated as exons and transcripts anyway, so it’s not them. In case anyone was wondering, I have tried running it unstranded, forward, and reverse. Reverse gets the best results.

featurecounts rnaseq • 1.4k views

ADD COMMENT • link 12 months ago by murphytho1401 • 0

score 2 · Answer 1 · 2023-04-20

2

Entering edit mode

12 months ago

i.sudbery 19k

The difference is almost certainly down to intronic reads. Depending on how your RNA-seq library was generated, 50% of reads in exons is not particularly unusual - I normally reckon 35%-50% of reads in protein coding exons for total RNA-seq (ribosome depleted) and around 65% for polyA selected. Unspliced primary RNA transcripts might be much less abundant than their spliced countarparts, but they cover much more genome space. Having one read 100 bases for 1kb will give the same number of reads as having 1 read every 2 bases for a 200 base exon. Intronic sequences makes up a larger portion of the genome.

ADD COMMENT • link 12 months ago by i.sudbery 19k

0

Entering edit mode

Ah, yes. I should have mentioned that I’m pretty sure it was indeed total RNA-Seq, not poly-A selected. And I definitely see what you mean about intronic reads potentially making up a large portion of the reds despite not being particularly plentiful in number. I’m still shocked that my number of aligned reads doubles, but the logic makes sense.

So, in your opinion, then, I should continue on using -t exon to only include exonic reads in my analysis? Or would counting be transcripts and including intronic reads more relevant? I’m just not quite sure which will give me the best, biologically relevant information.

It feels relevant to clarify that downstream, I’ll be doing the standard per-gene quantification and differential expression analysis with DESeq2. I won’t be comparing expression of different genes within one sample. So a bloated count due to intronic reads may not be the worst thing in the world? I’m not sure.

And by the way, thank you very much for your answer!

ADD REPLY • link 12 months ago by murphytho1401 • 0

0

Entering edit mode

I guess what I'm saying is that doubling your aligned reads when using transcript rather than exon is entirely within the normal range of expectation. I'm not sure how much work has been done on including intron counts in bulk RNA-seq. I think a fiar amount has been done for single cell RNAseq, I don't remember what the conclusion was. One thing I'd worry about is that you sometimes get whole seperate genes (particularly ncRNAs like miRNAs, snoRNAs and scRNAs) nestled in the introns of larger genes. These genes are quite often highly expressed, and so might account for most of the reads assigned to the containing gene, even if they are not coming from this gene.