Discrepancy in gene counts between GENCODE 23 and Ensembl 81/82?
1
0
Entering edit mode
8.5 years ago

I see the GENCODE 23 general statistics shows total number of protein-coding genes as 19797 in humans http://www.gencodegenes.org/stats/current.html, whereas Ensembl 81 contains 22017 unique protein-coding genes when I use filter for "protein_coding" Gene type (biotype) and "GENCODE basic annotation" using Ensembl BioMart services http://www.ensembl.org/biomart/martview/034b08dbbcea12ff30614193d4d293a0. As I understand ensembl imports the GENCODE gene-set and annotations, the gene counts should correlate. Could anyone please explain me what I am missing and why is this difference in the gene counts?

Thanks!

gene gencode biomart ensembl genome • 2.3k views
ADD COMMENT
2
Entering edit mode

We don't import Gencode, we make Gencode.

ADD REPLY
0
Entering edit mode

I'm not sure who deleted this post, but I've undeleted it.

ADD REPLY
1
Entering edit mode
8.5 years ago

22017 is the number of protein coding genes you get if you include patches, scaffolds and alternate haplotype contigs. If you exclude those, you'll get 19779 (the difference of 18 is due to Gencode including annotations for genes on the pseudoautosomal region of chrY).

There are multiple "basic" annotations from Gencode, one contains only the 25 regular chromosomes and has 19797 protein coding genes, the other has ~22000 protein coding genes, due to containing alternate haplotypes and patches and scaffolds.

ADD COMMENT
0
Entering edit mode

Thanks that was useful! could you also point me the link where I can get this information. I am also quit confused to see different numbers in Ensembl's "primary assembly" which says 20,296 coding genes http://jul2015.archive.ensembl.org/Homo_sapiens/Info/Annotation.

ADD REPLY
1
Entering edit mode

I got these from the GTF files (for release 82) that were used to generate the numbers on the webpages you're seeing. The 20296 presumably comes from adding in the ~500 "read through" genes (there's still a difference of 4 that are presumably from something else). I'm just getting these numbers with grep and awk, btw.

ADD REPLY
0
Entering edit mode

Cool, thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 3788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6