Question: Number of Transcripts in GRCh38
boyu9310 wrote:

I downloaded several canonical GRCh38 gtf files, but I found the number of transcripts are different in each gtf. I was hoping to find one with 198838 transcripts. Does anyone know where I could find a list showing the number of transcripts as to each version of canonical gtf file?

Edit: I have found that on the website of GenCode, there is a summary of transcripts number GenCode_summary. Also, in Ensembl, there is another series of annotation. How could I find one annotation file with a specific number of transcripts? As I said, 198838 transcripts?

Why are you looking for an annotation with exactly 198838 transcripts?

That's roughly the number of genes identified by GENCODE, but it includes upward of 40,000 pseudogenes and many more dozens of 1000s of transcripts that undergo nonsense mediated decay.

The last time that I did a RNA-seq experiment with Kallisto, I used the GENCODE FASTA v24 as reference and it had 199,170 transcripts, which includes all genes and known transcript isoforms. Here's a direct link to the listing (on my box dot com page), in case you're interested:

On the archives of Ensembl ( ) you can find old versions. In each case you can find under "More information and statistics" link for human genome detailed breakdown of the number of different features in particular assembly. An example from Jul2015 is here: None of the assemblies I've quickly checked had exactly 198838 transcripts (some had it close, but not equal to that number).

Thanks so much for the reply. I just found that there were some problems using Tablemaker to process. The number of transcripts in the Tablemaker output is not consistent with the reference annotation.

UK, Hinxton, EMBL-EBI
Denise - Open Targets4.6k wrote:

GENCODE calculates their stats taking into account the reference chromosomes only (check their README_stats.txt for the details), whereas the Ensembl provides the stats for reference chromosomes plus alternate sequences (haplotypes and patches).

If you go to the the Ensembl annotation page for human, you will see that their latest annotation "also includes 261 alt loci scaffolds, mainly in the LRC/KIR complex on chromosome 19 (35 alternate sequence representations) and the MHC region on chromosome 6 (7 alternate sequence representations)".

Your GENCODE v24 is on GRCh38.p5 (5th patched version of GRCh38) and the Ensembl annotation on GRCh38.p5 is available on their release 84. Ensembl reports 199,184 and the higher number is because of the transcripts annotated on the patches.

If you download the Homo_sapiens.GRCh38.84.gtf.gz from the Ensembl FTP release 84, you should be able to get the same numbers as the .gtf contains the annotation on the reference chromosomes, without patches and haplotypes.

Thank you so much! I found the information I needed. And I traced back to the annotation file used while running cufflinks. It seems that there's some mistake during either cufflinks or Tablemaker processing, which caused duplicated transcript names in a file. (Using Stringtie to process will not have this issue). And by filtering out these duplicated transcripts, they are the same as the reference.

Just looking for a small cross verification. The number of transcripts is 199184; what is the number of genes in this build? Is it 60675? So after read assignment with FeatureCounts/Stringtie I expect to get count for 60675 genes?

