How do I find up-to-date gene symbols for outdated names
1
0
Entering edit mode
4 weeks ago
DSmith • 0

I have a largish file (> 7 million entries) with ~18,000 gene symbols. Unfortunately, many of the symbols are outdated or refer to old mRNA sequences. Examples of the outdated names include AK094642, FLJ37183, KIAA2013, AY359883, BC029609, MGC40405, S63912, etc.

The modern symbols for many of these old names can be found one-by-one using Google (eg AK094642 is CFAP74), so the information is available. But the large numbers of old names preclude a repair operation by hand. I have searched high and low for a translation table of old names to new, including HGNC, UCSC table browser and Gencode GTF, but none seem to oblige. I am sure I must be missing something obvious. GeneCards seems very comprehensive, but I do not wish to pay their fee.

Could anyone please point me to a comprehensive translation table for old to new gene symbols?

Many thanks

names symbols gene • 324 views
ADD COMMENT
3
Entering edit mode
4 weeks ago
GenoMax 142k

You can try EntrezDirect. You may not end up getting any useful gene names though since many of these records are likely to have been discontinued in light of new information.

$ esearch -db gene -query AK094642 | efetch -format acc

1. CFAP74
Official Symbol: CFAP74 and Name: cilia and flagella associated protein 74 [Homo sapiens (human)]
Other Aliases: C1orf222, CILD49, KIAA1751
Other Designations: cilia- and flagella-associated protein 74
Chromosome: 1; Location: 1p36.33
Annotation: Chromosome 1 NC_000001.11 (1921957..2003786, complement)
MIM: 620187
ID: 85452

2. LOC728690
uncharacterized LOC728690 [Homo sapiens (human)]
Chromosome: 1; Location: 1p36.33
Annotation: Chromosome 1 NC_000001.9 (1863709..1864446, complement)
This record was replaced with GeneID: 85452
ID: 728690

3. LOC348525
hypothetical gene supported by AK094642 [Homo sapiens (human)]
Chromosome: 1
This record was discontinued.
ID: 348525

4. LOC339458
hypothetical gene supported by AK094642 [Homo sapiens (human)]
Chromosome: 1; Location: 1p36.32
This record was discontinued.
ID: 339458
ADD COMMENT
0
Entering edit mode

Thank you very much for this suggestion GenoMax, which was most helpful. I put your suggested query in a simple loop:

for i in `cat old_genes.txt`; do echo "${i}"; echo -n "\n${i}++" >> ans.txt; esearch -db gene -query "${i} AND human [ORGN]" | efilter -status alive | efetch -format acc | grep "^1. .*$" >> ans.txt; done
sed "s/++1.\\ /\t/g" ans.txt | sed "s/++/\t/g" | awk 'NF' >> ans2.txt

This loop fails for some well known gene names. For example, it translates LDLR into TP53. But for strangely named "genes", eg AK094642, it seems to work well, finding modern replacements for roughly 70-80% of the old gene names.

ADD REPLY
0
Entering edit mode

Further examination indicates that my little "codelet" using EntrezDirect has a fairly high error rate. I have not examined these errors very carefully, but there appears to be unfortunate substitutions of well known gene names by other well known gene names. These incorrect substitutions may possibily originate from the EntrezDirect database. I suggest that careful examination of any name changes would be prudent.

ADD REPLY

Login before adding your answer.

Traffic: 2650 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6