HGNC gene symbols
Entering edit mode
6.8 years ago
hirak.sarkar ▴ 20

I was looking into Ensembl gene name to HGNC gene symbol mapping and, a snapshot of it looks like this,

ENSG00000256024.1  CT476828.6
ENSG00000256252.1  CT476828.7
ENSG00000255638.1  CT476828.4
ENSG00000256521.1  CT476828.11
ENSG00000256490.1  CT476828.10
ENSG00000238720.1  CT476828.2
ENSG00000148828.5  CT476828.1

Now from http://www.gencodegenes.org/gencodeformat.html I understood the Ensembl gene ids has the version number appended with them, but I wonder what are the dot appended values for the hgnc names? Are they unique? If I remove the dots, then many Ensembl names would be mapped to same HGNC gene symbol. Can anyone explain the naming protocol?


gene ensembl hgnc • 5.3k views
Entering edit mode

First, don't use the version number for Ensembl IDs. Second, CT476828.1 is not a HGNC gene name. HGNC gene names are like "polo-like kinase 1" with associated gene symbol "PLK1". CT476828.1 looks more like a contig ID to me. Ensembl gene names and gene symbols are taken from HGNC so once you have the gene ID, you can directly get the gene name either with BioMart or the API:

my $Ensgene = $gene_adaptor->fetch_by_stable_id($EnsemblID);
my $HGNC = $Ensgene->external_name();
Entering edit mode

Thanks for clearing the confusion. I still don't understand what the appended ".1", ".2" etc signify. Also I thought they were gene names because I prepared the mapped list from a gtf file. Here is a line from the gtf file which refers this symbol as gene name.

GL000228.1      ENSEMBL gene    92463   94085   .       +       .       gene_id "ENSG00000256024.1"; transcript_id "ENSG00000256024.1"; gene_type "pseudogene"; gene_status "NOVEL"; gene_name "CT476828.6"; transcript_type "pseudogene"; transcript_status "NOVEL"; transcript_name "CT476828.6"; level 3;

Also edited the question mentioning gene symbols.

Entering edit mode

The .1 after ENSG00000256024 is the version number. I can't really think of a use for it because if the gene changes significantly then it gets a new ID. Also many tools don't recognize it.
It looks like the CTxxxxx correspond to novel non-coding transcripts and that the genes were named after the corresponding transcripts. Gene symbols are generally associated with well characterized genes so novel genes usually get some sort of ID as name.

Entering edit mode

I think my confusion comes from the distinction between novel genes and well known genes.

Thanks for the help!


Login before adding your answer.

Traffic: 1449 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6