I have a largish file (> 7 million entries) with ~18,000 gene symbols. Unfortunately, many of the symbols are outdated or refer to old mRNA sequences. Examples of the outdated names include AK094642, FLJ37183, KIAA2013, AY359883, BC029609, MGC40405, S63912, etc.
The modern symbols for many of these old names can be found one-by-one using Google (eg AK094642 is CFAP74), so the information is available. But the large numbers of old names preclude a repair operation by hand. I have searched high and low for a translation table of old names to new, including HGNC, UCSC table browser and Gencode GTF, but none seem to oblige. I am sure I must be missing something obvious. GeneCards seems very comprehensive, but I do not wish to pay their fee.
Could anyone please point me to a comprehensive translation table for old to new gene symbols?
Many thanks
Thank you very much for this suggestion GenoMax, which was most helpful. I put your suggested query in a simple loop:
This loop fails for some well known gene names. For example, it translates LDLR into TP53. But for strangely named "genes", eg AK094642, it seems to work well, finding modern replacements for roughly 70-80% of the old gene names.
Further examination indicates that my little "codelet" using EntrezDirect has a fairly high error rate. I have not examined these errors very carefully, but there appears to be unfortunate substitutions of well known gene names by other well known gene names. These incorrect substitutions may possibily originate from the EntrezDirect database. I suggest that careful examination of any name changes would be prudent.