Question: Finding Gene Symbol Synonyms
18
gravatar for Mike Dewar
8.9 years ago by
Mike Dewar1.5k
Columbia University, NYC, USA
Mike Dewar1.5k wrote:

Some HGNC Gene Symbols have synonyms that are more familiar to biologists of particular breeds. For example, "SELL" means little to a immunologist, whereas SELL's alias "CD62L", means rather a lot. Showing the biologist a list of gene names and saying "do any of these ring bells" seems to be an important part of the process of selecting important genes (or, rather, its validation), and hence I'd like to make sure they see the gene names that make sense to them.

My question, therefore, is: does anyone know a simple method to retrieve gene synonyms? I don't want to do any enrichment, or clustering or normalisation, I just need a mapping from HGNC symbol -> synonyms. I can't quite figure out how to persuade biomart to do this.

In addition, are there certain sets of symbols that are preferred by some communities? For example, do I need to search through all the synonymous symbols, or can I just ask biomart (or something) to return a particular set of gene symbols?

annotation • 17k views
ADD COMMENTlink modified 5.7 years ago by Biostar ♦♦ 20 • written 8.9 years ago by Mike Dewar1.5k
20
gravatar for Andrew Su
8.9 years ago by
Andrew Su4.8k
San Diego, CA
Andrew Su4.8k wrote:

If you wanted to this analysis for a large number of gene symbols and/or from the command line, I would first download gene_info.gz from here, and then use awk to parse. For example, SELL has the Entrez Gene ID of 6402, so:

gzip -cd gene_info.gz | awk '$2==6402{print $5}'

produces this output:

CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1

(The second column of gene_info is Entrez Gene ID, the fifth column has the aliases)

You can also do a similar awk parsing based on the gene symbol directly, but then you probably also want to limit it by organism (e.g., human=9606). For example:

gzip -cd gene_info.gz | awk '$3=="SELL"&&$1==9606{print $5}'

produces the same output as above...

To get a file that translates all human gene symbols to their aliases:

gzip -cd gene_info.gz | awk '$1==9606{print $3"\t"$5}' > output.txt
ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.9 years ago by Andrew Su4.8k
3

Four years since you've posted this, I've just found it.  Exactly what I was looking for, thanks.  Similarly to the initial poster, I am just interested in H. sapiens genes.  This means you don't need to download the rather large full list from Entrez but can limit yourself to:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Stu@IC30

Very useful, thanks!

One precision. If you are searching for official symbol (=$11), sets the -F option to "\t". For example:

gzip -cd gene_info.gz | awk -F "\t" '$1==9606&&$3==&SELL&{print $3"\t"$5"\t"$11}'

## $3 = Symbol
## $11 = Symbol_from_nomenclature_authority
ADD REPLYlink modified 7 months ago by RamRS21k • written 4.5 years ago by LGMgeo90

I have an R wrapper for this at https://github.com/oganm/geneSynonym It extracts synonym information about the species of interest and allows you to querry any gene symbol for synonyms.

ADD REPLYlink modified 23 months ago • written 4.4 years ago by oganm60
7
gravatar for Khader Shameer
8.9 years ago by
Manhattan, NY
Khader Shameer18k wrote:

GeneALaCart from GeneCards will be a good start. There will be definitely other resources which can do this type of mapping, but from my ID mapping experience GeneCards provides good number of aliases & descriptions for human genes.

A quick search using GeneALaCart got the following aliases for CD62L

Copied from the output CSV file :

Gene Symbol : SELL 
Entrez_Gene ID : 5579    
HGNC_ID : 9395
Aliases : LEU8 |LAM1 |LECAM1 |hLHRc |Leu-8 |TQ1 |LAM-1 |LSEL |PLNHR |LNHR |CD62L |gp90-MEL |LYAM1 |Lyam-1
ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.9 years ago by Khader Shameer18k

It is very strange that the Entrez_Gene ID of your instance is 5579 because it is the Entrez_Gene ID of PRKCB protein kinase C. It should be 6402.

ADD REPLYlink written 8.9 years ago by Fred Fleche4.3k
6
gravatar for Fred Fleche
8.9 years ago by
Fred Fleche4.3k
Paris, France
Fred Fleche4.3k wrote:

In your case, may be the easiest way would be to use the HGNC output data webpage.

You can easily check the fields of your choice like:

  • Approved Symbol
  • Aliases
  • Entrez Gene ID

Then also check

  • Select Status Approved
  • Select all Chromosomes

Then press submit to get the listing as text file that you can either use in Excel or insert in a sql database.

in the case of the SELL gene reported previously you get :

SELL#LSEL, LAM1, LAM-1, hLHRc, Leu-8, Lyam-1, PLNHR, CD62L#6402
ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.9 years ago by Fred Fleche4.3k
1

Fred - IMHO stands for "in my honest opinion"! I think you have a fan, rather than a competitor...

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k
1

I find French online acronyms to be very difficult aslo, though typically more fun! I went with the awk-based answer above as it involves less clicking, though I think your answer will be very helpful to others coming across this question...

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k
1

IMHO, this has been my fault. in the future I'll try to be more precise and academic ;)

ADD REPLYlink written 8.9 years ago by Jorge Amigo11k

IMHO this is, by far, the easiest way of retrieving such data

ADD REPLYlink written 8.9 years ago by Jorge Amigo11k

@Jorge. You are very welcome to click on the button "Add Another Answers" and demonstrate how to get the listing through IMHO. I think everybody here is eager to learn new method. So do not hesitate to expose/share your method.

ADD REPLYlink written 8.9 years ago by Fred Fleche4.3k

@Jorge. Actually I didn't know this english acronym. I thought it was a bioinformatics server. And don't worry I will never considere you or other as competitors. I am here to learn new solutions. I am glad you like my solution. Feel free to click the "Click to set this answer as your accepted answer" button ;-)

ADD REPLYlink written 8.9 years ago by Fred Fleche4.3k
4
gravatar for Pierre Lindenbaum
8.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

unfortunately , this CD62L is not present in the UCSC DB, however, here is a query for another gene (PRBC1) listing the position and the aliases.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select distinct
 K.chrom,
 K.txStart,
 K.txEnd,
 A1.alias,
 A2.alias
from
 knownGene as K,
 kgAlias as A1,
 kgAlias as A2
where
 K.name=A1.kgID and
 K.name=A2.kgID and
 A1.alias<A2.alias and
 (A1.alias="PRKCB1" or A2.alias="PRKCB1") '

result:

+-------+----------+----------+------------+------------+
| chrom | txStart  | txEnd    | alias      | alias      |
+-------+----------+----------+------------+------------+
| chr16 | 23754822 | 24134810 | NM_002738  | PRKCB1     |
| chr16 | 23754822 | 24134810 | NP_002729  | PRKCB1     |
| chr16 | 23754822 | 24134810 | P05771-2   | PRKCB1     |
| chr16 | 23754822 | 24134810 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB1     | uc002dmc.1 |
| chr16 | 23754822 | 24139063 | KPCB_HUMAN | PRKCB1     |
| chr16 | 23754822 | 24139063 | NM_212535  | PRKCB1     |
| chr16 | 23754822 | 24139063 | NP_997700  | PRKCB1     |
| chr16 | 23754822 | 24139063 | O43744     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05127     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05771     | PRKCB1     |
| chr16 | 23754822 | 24139063 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q15138     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q93060     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE49     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE50     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UEH8     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ30     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ33     |
| chr16 | 23754822 | 24139063 | PRKCB1     | uc002dmd.1 |
+-------+----------+----------+------------+------------+
ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.9 years ago by Pierre Lindenbaum119k
2

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

presumably this would work if you used the official gene symbol SELL?

ADD REPLYlink written 8.9 years ago by Andrew Su4.8k

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

@ Andrew : Could you clarify this.

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names. @ Andrew : Could you clarify this

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

yes, it does work with SELL

ADD REPLYlink written 8.9 years ago by Pierre Lindenbaum119k

Just checked for PRKCB at GeneAlaCart it retrieves

PRKCB2 |MGC41878 |PRKCB1 |PKCB |PKC-B |PKC-beta |EC 2.7.11.13

as Aliases. Curious to know if we can get such gene synonyms via UCSC DB.

ADD REPLYlink modified 7 months ago by RamRS21k • written 8.9 years ago by Khader Shameer18k

@Khader, ah, ok :-)

ADD REPLYlink written 8.9 years ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1206 users visited in the last hour