Looking for documentation on the semantics of kgXref.geneSymbol column in UCSC's hg19
Entering edit mode
5.9 years ago
kynnjo ▴ 40

I am struck by the, let's say, lexical heterogeneity of the entries in the `geneSymbol` column of UCSC hg19's `kgXref` table.  Here's a sample[1]:

    Ig kappa
    cytochrome b
    Ig alpha 1-[alpha]2m
    aromatase cytochrome P-450 (P-450AROM)
    immunoglobulin epsilon chain constant...
    T-cell receptor alpha chain variable ...

I would like to know more about the "semantics" of this table's `geneSymbol` column, but I am having a really hard time finding authoritative[2] answers to my questions.  (These questions include, among others, the following.  What is the provenance of these "gene symbols"?  Is UCSC the ultimate authority on them, or are they getting these symbols from some other authority?  Who/what ensures that distinct symbols always refer to distinct genes?  Etc.)

If I go to


select  `kgXref` from the "table" dropdown, and then click on "describe table schema", the resulting page shows a lot of useful information, but it does not tell me anything about how this table was put together.  In particular, it tells me nothing relevant to questions like the ones mentioned earlier.

[1] The `...` at the end of the last two entries in the list belong, in fact, to the values stored in the table.  The length of both of these `geneSymbol` entries is 40; they are the longest ones in the table.  FWIW, the type of the `geneSymbol` column is `varchar(255)`.

[2] By "authoritative answers" I mean answers that come from a publication (preferably peer-reviewed) authored by those who produced the database.  It is not too difficult to come by educated guesses to answer at least some of these questions.  I probably could do a passable job myself, but this is not what I am after.


genome • 2.0k views
Entering edit mode
5.9 years ago

The authoritative source is HGNC, though what's officially supposed to be used and what is often used aren't always the same. For humans, everyone (that includes UCSC) just gets their annotations from Gencode, which presumably uses HGNC approved symbols.

Note however that there's a lot of messy history to these symbols, which is why you'll often find a large number of possible symbols for any given gene.

Entering edit mode

Thanks. Do you by any chance know if the UCSC database has some identifier/column that uniquely identifies human genes (in a strict 1-to-1 correspondence between genes and these identifiers)? The so-called "known gene ID" (aka kgID) cannot be it, because there are 82,960 distinct kgIDs in kgXref, which seems to me just too high. (Here again, it sure would be nice to have some authoritative documentation on the semantics of the kgID column.) In contrast, kgXref mentions only 28,514 distinct "geneSymbols", a number that seems to me more in line with the commonly cited estimates of the number of genes in the human genome. I sure hope, however, that the UCSC genome database has a more carefully controlled set of identifiers than these chaotic "geneSymbols" to uniquely identify what is probably the most important entity in their database.

Entering edit mode

Many of us avoid UCSC since their annotation have historically been...problematic. You'll likely be better off with Ensembl/Gencode. Ensembl IDs should prove to be a superset of what's in HGNC.


Login before adding your answer.

Traffic: 1320 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6