GENCODE versus Ensembl gene annotations
Entering edit mode
5.0 years ago
igor 12k

What is the difference between GENCODE and Ensembl annotation? That's actually the first question in the GENCODE FAQs:

The GENCODE annotation is made by merging the Havana manual gene annotation and the Ensembl automated gene annotation. ...In practical terms, the GENCODE annotation is identical to the Ensembl annotation.

I am looking at the mouse data for GENCODE M15 compared to Ensembl 90, which should be comparable according to both source. Total number of transcripts is 131,100 vs 131,195, so that difference is negligible. However, some subsets are very different. The number of protein-coding genes is 21,950 vs 22,598, which is a little more noticeable. Long non-coding RNA genes is 11,975 vs 8,980, so more than 30% drop. Thus, it seems like annotation is not really identical. Are those differences real or are they just counting the gene biotypes differently?

ensembl gencode gtf • 4.4k views
Entering edit mode
5.0 years ago

The Gencode statistics webpage refers to the annotation on the reference chromosomes only, which has 131,100 transcripts.

The Ensembl statistics include all primary assembly regions, which explains the higher number of transcripts (131,195).

Furthermore, the grouping of the gene biotypes for the statistics webpages differ between Gencode and Ensembl:

Gencode includes only the 21,950 genes with a "protein_coding" biotype under the "Protein-coding genes" category on the webpage. Ensembl reports in the "Coding genes" category all genes that contain an ORF, which adds the IG/TR genes for example.

In the case of the long non-coding RNA genes, the difference is due to the inclusion of the TEC genes by Gencode (~3000).


Login before adding your answer.

Traffic: 2136 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6