Looking for documentation on the semantics of kgXref.geneSymbol column in UCSC's hg19
1
0
Entering edit mode
5.9 years ago
kynnjo ▴ 40

I am struck by the, let's say, lexical heterogeneity of the entries in the geneSymbol column of UCSC hg19's kgXref table.  Here's a sample[1]:

T
AR
C3
TRA@
HGC6.3
Z70701
unknown
Ig kappa
TIMELESS
5_8S_rRNA
OK/SW-cl.16
cytochrome b
Em:AC005003.4
Ig alpha 1-[alpha]2m
DTX2P1-UPK3BP1-PMS2P11
aromatase cytochrome P-450 (P-450AROM)
immunoglobulin epsilon chain constant...
T-cell receptor alpha chain variable ...

I would like to know more about the "semantics" of this table's geneSymbol column, but I am having a really hard time finding authoritative[2] answers to my questions.  (These questions include, among others, the following.  What is the provenance of these "gene symbols"?  Is UCSC the ultimate authority on them, or are they getting these symbols from some other authority?  Who/what ensures that distinct symbols always refer to distinct genes?  Etc.)

If I go to

select  kgXref from the "table" dropdown, and then click on "describe table schema", the resulting page shows a lot of useful information, but it does not tell me anything about how this table was put together.  In particular, it tells me nothing relevant to questions like the ones mentioned earlier.

[1] The ... at the end of the last two entries in the list belong, in fact, to the values stored in the table.  The length of both of these geneSymbol entries is 40; they are the longest ones in the table.  FWIW, the type of the geneSymbol column is varchar(255).

[2] By "authoritative answers" I mean answers that come from a publication (preferably peer-reviewed) authored by those who produced the database.  It is not too difficult to come by educated guesses to answer at least some of these questions.  I probably could do a passable job myself, but this is not what I am after.

genome • 2.0k views
2
Entering edit mode
5.9 years ago

The authoritative source is HGNC, though what's officially supposed to be used and what is often used aren't always the same. For humans, everyone (that includes UCSC) just gets their annotations from Gencode, which presumably uses HGNC approved symbols.

Note however that there's a lot of messy history to these symbols, which is why you'll often find a large number of possible symbols for any given gene.

0
Entering edit mode

Thanks. Do you by any chance know if the UCSC database has some identifier/column that uniquely identifies human genes (in a strict 1-to-1 correspondence between genes and these identifiers)? The so-called "known gene ID" (aka kgID) cannot be it, because there are 82,960 distinct kgIDs in kgXref, which seems to me just too high. (Here again, it sure would be nice to have some authoritative documentation on the semantics of the kgID column.) In contrast, kgXref mentions only 28,514 distinct "geneSymbols", a number that seems to me more in line with the commonly cited estimates of the number of genes in the human genome. I sure hope, however, that the UCSC genome database has a more carefully controlled set of identifiers than these chaotic "geneSymbols" to uniquely identify what is probably the most important entity in their database.

1
Entering edit mode

Many of us avoid UCSC since their annotation have historically been...problematic. You'll likely be better off with Ensembl/Gencode. Ensembl IDs should prove to be a superset of what's in HGNC.