Gene/Protein Name Etymology
Entering edit mode
7.5 years ago
axelwilhelm ▴ 110

How can I find where a protein or gene name comes from?

I have been googeling for neuropeptide Y. Why Y? Y Y? Y why? Where does the Y come from and how can I find out in the future?

gene protein • 4.2k views
Entering edit mode
7.5 years ago
Superbest ▴ 130

I also tend to wonder about gene names, but in my experience, finding out the reason is often not that helpful, unless you enjoy useless trivia (no offense meant - personally I like trivia and it sometimes helps me remember non-trivial facts).

The patterns that are consistent enough to be worth paying attention to happen to also be the most boring ones:

  • Some genes are just given a random symbol, which happend to be a bunch of letters that are not already taken by other gene names, because the researcher doesn't particularly care what the gene name is.

  • Newly discovered genes are often named after their crudest features: Zeb1, a developmental protein, is "Zinc finger E-box Binding domain".

  • Genes with simplistic functions are named accordingly: Enzymes which have just one reaction they catalyze are named after it (this rule is so famous that it's taught in middle school) - eg. lactase breaks down lactose.

  • Like enzymes, receptors are often named "receptor of [ligand that it recognizes]".

  • If there are multiple copies of a gene (paralogs or splice variants) they go 1, 2, 3 or a, b, c or alpha, beta, gamma... Technically paralogs should be given numbers and splice variants get English letters, but it's not like there's a gene name police who will come after you for breaking the rule. Organizations like HUGO sometimes try to make people be less messy with their nomenclature (so that the bioinformaticians don't have to tear out their hair as much when writing literature parsers), but every field has a few senior scientists who find the good old "cute" names endearing and don't want to break tradition (well, renaming genes also generates confusion because now the same gene is referred to by different names in different papers). Also, sometimes people will find that after numbering a bunch of genes, not all of them are actual genes - this is why for example human myosins go Myh1 to Myh17, but there are only 14 total (some are retracted genes - Myh5 turned out to be the same as Myh16, and then it turned out that it wasn't a gene at all) and there is a Myh7b that is a different locus entirely from Myh7.

  • Many genes are named "Homologous to [some well-known yeast, human or e. coli protein]". This is because when you first sequence the genome of a new species, the annotation software finds a bunch ORFs which are probably bona fide genes, but nobody knows what they do. They are named for whatever well-studied gene they are most similar to, since that's a good guess to start from. Obviously, very often they have a completely different function - but sometimes the name sticks anyway and becomes yet another stumbling block for biology students who aren't in on the joke.

  • Relatedly, protein families and domains are often given names which are the initials of well known members of that family. For example, the PDZ domain is "Post synaptic density protein, Drosophila disc large tumor suppressor, Zonula occludens-1 protein".

  • Some genes are named after the tissue or animal in which they were first discovered. This becomes amusing when it turns out the gene's function has little to do with that tissue: Cadherins are proteins that glue cells together. Cdh1 glues them very tightly, making skin - it's called epithelial cadherin. Cdh2 is much less tight; it is called neural cadherin, but it's found in other places as well - like fibroblasts.

I think the most interesting stories come from patterns which are rare enough to be exceptions:

  • p21 and p53 are called that because in the deep dark ancient days when nobody knew anything, somebody threw some cell extract on a gel and did a Western. A bunch of bands came out - it was assumed that each band is a protein that somehow related to the cell cycle. Since the only thing known about them at that point was mass, they were named after mass. p21 for instance runs at 21 kDa. These proteins quickly generated so many publications referencing their provisional name, that getting the community to start using the new name was too much work by the time that everyone realized how important they were - and it's not very easy to come up with a name that would do justice to these very versatile proteins. p53 was renamed to TP53 for "tumor suppressor p53" - but I find it humorous that "a protein that runs at 53 kDa and happens to be one of the hundreds of tumor suppressors" is somehow a much better name than "a protein that runs at 53 kDa". Amusingly, p53 isn't even 53 kDa in size - its running at 53 kDa is an artifact.

  • Some proteins are even more involved diaries of their discovery: 14-3-3 proteins were found in the 14th fraction at position 3.3 on an electrophoretic gel. Similar to some cell lines: 4T1 cells (which annoyingly sound like "41 cells" at talks), for instance, originally came from a 10^4 cell injection at day 1 - hence 4 and T1.

  • Some gene names are jokes. There is a protein which is associated with Dicer and involved in processing of miRNA. Not missing the opportunity for a clever pun, the researchers named it "Related To Dcr2" - R2D2 ( There's apparently an overabundance of such names:

  • Some genes are clever puns on the more "serious" naming systems above - SNAI1 is a perfectly serious name, as I recall it refers to structural elements of the protein. Someone was apparently much amused at noticing that the 1 looks like an L, so the gene is now referred to as "Snail" (it has nothing to do with snails, it makes cells move around the notochord to form the spinal column in embryos). Naturally, SNAI2 is called "Slug", because a slug is just like a snail, see, except without a shell (I think Snai1 has a GSK3-beta degradation tag while Snai2 lacks it). I don't know what a Smuc is, but that's what they call SNAI3.

  • Genes which are characterized through null mutations, especially yeast genes, are named after the phenotype of the mutant. CAN1 is a pump which allows the antibiotic canavanine to kill the yeast - it's called "canavanine resistance" because null mutant strains (named "can1 strains", lowercase for mutated gene) are resistant. It's a bit confusing because having this canavanine resistance gene actually makes cells not resistant to canavanine. This even happens with humans - for instance XPB is named after the weird, obscure "vampirism disease" Xeroderma pigmentosum. There is nothing obscure about the protein, it is one of the fundamental parts of the eukaryotic transcription machinery.

  • It is a tradition in developmental biology, especially for work on Drosophila embryos (which is where much of developmental work is done), to give cells cute nicknames after what the embryo looks like when the gene is non-functional (most developmental mutations, especially the important genes, are so vital that the embryo rarely survives to adulthood). So you get genes like Giant (because the embryos are slightly larger in Giant knockouts), Kruppel (German for "cripple", for some reason many of these names are also German), Hunchback, Tailless... The famous hedgehog genes make the embryo look spiny - presumably someone at the lab was an avid gamer, and one gene references the famous blue, swift rodent. The others get names like desert hedgehog, Indian hedgehog, African hedgehog, because the genes are phylogenetically related just like the hedgehogs, you see (then there's also the tiggywinkle hedgehog)? This one happened to cause a stir when people with the very serious diseases caused by Sonic Hedgehog mutations resented the jocular attitude of geneticists studying the source of their ailment.

  • Fly geneticists tend to come up with some fanciful names in general. Christiane Nüsslein-Volhard, upon seeing embryos with mutations in what is now known as the Toll gene, exclaimed "Das ist ja toll!" ("That's amazing!")- hence the name. Another example is the unit of Drosophila compound eye - each unit is made up of 7 cells, and when the gene Sev (sevenless) is deleted, the seventh cell does not develop. Downstream/upstream interactors of this gene became "son of sevenless" (SOS - immediately downstream of Sev, and can also rescue a sevenless mutant) and "bride of sevenless" (co-receptor of Sev).

  • Some gene names are holdovers from old misconceptions that never quite went away. The Deletion of DCC was once suspected to be a pivotal event leading to Colorectal Cancer, but it turned out that it's not really that pivotal for colorectal cancer, but it is very pivotal in the pivoting of nerve cells axons. (That's not to say the involvement in cancer is a misconception, but it's certainly not as important as was thought when the name was assigned.)

Stories may not tell you much about the role of a gene - thought they make for interesting dinner conversation (if you don't mind finding that very soon into the conversation, you are the only participant remaining).

As often as not, these in-jokes and accidents of history are not really well-recorded, and very difficult to find with algorithms like PageRank which specifically try to skip trivial information and show you the salient stuff. Probably the best way of learning this "gene nomenclature lore", in my experience, is getting drunk with other scientists at conferences.


Login before adding your answer.

Traffic: 2038 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6