FeatureCounts (Ensembl based GTF v.s In-built Entrez GTF) variances in count data
13 months ago

Hi everyone. Firstly thank you in advance for any help you can give, I am new to bioinformatics and biostars has been immensely helpful. I have human RNA-seq data that I am currently processing, I've gone through my trimming and aligning (with STAR) stages and have just used featureCounts to counts in my data.

I have tried two different methods for featureCounts both worked but varied in their count data. Firstly I used the HG38 GTF from ensembl and secondly I used the built in HG38 GTF from the RSubread package, (entrez gene)...

Both were successful but I compared corresponding genes between ensembl and entrez gene and the count data was quite different - Total number of reads also differed from 36165730 to 38752850 respectively.

Why would the total number of counts be higher in the case of entrez gene? - Seems strange considering ensemble is larger in scope. I understand that ensembl and entrez do not completely align but the differences seemed quite dramatic, Is this normal? and if I use the entrez values is this okay considering I aligned my data using an ensmbl GTF.

