Different featureCounts data using GTF annotation file from two different sources.
1
1
Entering edit mode
8 weeks ago
Soumajit ▴ 30

Recently I was doing RNA-seq for a dataset of 30 samples (15 Fwd, 15 Rev). The pipeline was STAR alignment, then featureCounts for the gene level quantification.

Case 1: So, for that, I initially used an annotation file from the UCSC Genome browser website. The file name is hg38.refGene.gtf.gz (This one I usually use and I am sure a lot of people do the same).

Case 2: Now, I just wanted to try a different annotation. So, I ran the STAR alignment and featureCounts again for the same set of samples. This time, the gtf file was from NCBI RefSeq. The file name is GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz. It says "The annotation is NCBI Homo sapiens Updated Annotation Release 110 from 25 February 2022."

While I got nearly identical alignment values from STAR after both runs using different annotation files, the featureCounts values were not the same. Assignment rates are 2-3% higher in case 2 for almost all samples, also in the gene counts the fragment numbers are sometimes different. Surprisingly, in case 2, the count for some genes has values (sometimes quite high), while in case 1, it has 0 counts.

It is a bit of a confusion for me. So, can any of you experts explain why is this anomaly? Do you think this Case 2 annotation file is more recent and has more annotations, that is the sole reason? If yes, should I use this GTF file for the analysis of my RNA-seq? Or I am totally wrong and using the wrong GTF for a normal RNA-seq pipeline with the goal to study DGE?

Some details:

1. Samples are from Humans and are paired-end reverse-stranded reads

featureCounts RNA-seq STAR • 592 views
1
Entering edit mode

You need to make sure that your genomes and annotations match each other. Don't get a genome from one place and an annotation for somewhere else. But you used different inputs and got different outputs. Why is that a surprise?

3
Entering edit mode
8 weeks ago
Gordon Smyth ★ 5.4k

See https://bioinf.wehi.edu.au/Rsubread/annot/ for the annotation files (in SAF format) that I use for gene-level counting with Rsubread::featureCounts.

There's nothing surprising about getting slightly more reads from the NBCI full analysis set than from the UCSB RefSeq set. The former includes more transcripts and exons, many of which are computationally predicted and less reliable, so you will naturally be able to assign more reads to them. It's your judgement as to which to use. For my own gene-level RNA-seq analyses, I tend to prefer curated gene annotation rather than annotation with a large number of computationally predicted transcripts, for reasons similar to those explored by Chisanga et al (2022).

Reference

Chisanga, D., Liao, Y. & Shi, W. Impact of gene annotation choice on the quantification of RNA-seq data. BMC Bioinformatics 23, 107 (2022). https://doi.org/10.1186/s12859-022-04644-8

0
Entering edit mode

Thank you Gordon Smyth for the clear explanation and suggestion.