Why do many genes have several synonyms? Because it was found by different people at different time? How were these genes named? Is there some rule for naming? Whether the gene name shows biological meaning? What kind of information can I get from the gene name? What does it indicates if two genes have a similar name? e.g., XXXX1, XXXX2?
It's usually because of the way they were discovered, which can lead to all sorts of naming conventions. For example a gene discovered in Drosophila will usually be named after its phenotype, eg weird wing. The same gene will cause a completely different phenotype in yeast (they don't have wings!) so will be given a totally different name eg weird mitosis. The fly and yeast people will then go off and find the human orthologue and give it a related name, eg WW1 and WM1. Meanwhile, someone else will be doing some reverse genetics in human and be calling it something like locus1 or naming it after the contig on the genome where its found, eg RP11-I-found-it-here. Somebody else will find that it has some similarity to their favourite gene (TFG) so call it TFG-like or TFGL. Another person will discover it from studying a biological process, so call it biological process 1 (BP1).
Then HGNC will come along, whose mission is to provide unique and meaningful names for genes, and they will give it a definitive name, probably the name that tells you the most about its function, which in this case is BP1.
But that's not the end! The genome annotators then discover that it's not actually one gene, but two duplicated genes, so they each become BP1A and BP1B. All of those other names exist in the literature so both genes get all the synonyms WW1, WM1, Locus1, RP11-I-found-it-here, TFGL and BP1.
Yes, many (most) genes have several synonyms in part because they were 'found' by different people at different times in different contexts. One of the best resources that aggregates the synonyms of genes is Entrez Gene. Sometimes more than one community of researchers that have different focus are working on a gene and they have each chosen names that describe the gene as it relates to that area of focus. For example, many genes were discovered by disease genetics studies. These genes tend to be named for the disease that they are related to. In any case, once this happens and alternate names are used in publications we need to keep track of the synonyms so that researchers can easily find all relevant information. To help mitigate the potential confusion caused by gene synonyms, many journals now encourage use of 'official' gene names (more on that below).
Yes, the name often shows biological meaning. When very little is known about a gene it is often named according to its physical locus in the genome, or a predicted open reading frame (ORF) that it contains. When nothing is known about its function, it may be named based on its homology to another gene that is well characterized. As mentioned above it may be named to reflect relevance to certain diseases. As the function is better characterized and the gene is placed in broad functional categories (e.g. a receptor tyrosine kinase) it may be named to reflect this function. Certain functional categories such as kinases and ion channels have community established conventions for more specific naming.
Yes, there are rules. Though these are constantly evolving. For human, perhaps the most notable effort to formalize the naming of genes is the HUGO Gene Naming Committee (HGNC), a Committee of the Human Genome Organization (HUGO). The HGNC provides a very detailed series of guidelines for human gene nomenclature.
There are many organizations dedicated to annotating the genome of various species. These are the source many commonly used gene identifiers. For human, there are efforts such as: Entrez, UCSC, Ensembl, OMIM, Vega, CCDS, MGC, Refseq, etc. These efforts attempt to resolve ambiguity in the defining and naming of operational units of the genome (genes). Inevitably they also introduce ambiguity as well.
There are multiple explanations for why a gene might have two similar names such as XXXX1, XXXX2. One possibility is that two adjacent regions in the genome are at one point considered distinct. But at a later point one of the organizations above determines that they should be merged. For example, there might be two transcript isoforms that were considered distinct genes at one point but are now considered alternative isoforms of the same locus. The same can happen in reverse, where a single locus is divided into two distinct genes. For this reason and others, there are many cases where the same synonym is used for several genes. You can get a sense of this by downloading Entrez's complete list of human gene names and synonyms (Homo_sapiens.gene_info.gz).
To illustrate all of this, lets consider an example: FLT4. Right now the HGNC provided name for this is FLT4. In the future this could change but for the most part these names are stable. FLT4 has at least four synonyms: PCL; FLT41; LMPH1A; VEGFR3. The official full name is 'fms-related tyrosine kinase 4'. In this case, the gene is being named according to its function as a kinase (an enzyme that transfers phosphate groups from a donor molecule to a substrate). It has the synonym 'VEGFR3' because it acts as a tyrosine kinase receptor for vascular endothelial growth factors C and D. So VEGFR3 describes its name in a signalling context, another broad way to think about the function of certain genes. It is also called 'LMPH1A' because mutations in this gene cause hereditary lymphedema type IA. FLT41 refers to a long isoform, expressed from the same locus.
To get a good feel for how gene naming evolves, repeat this exercise for several genes, examining the Entrez Gene record, HUGO entry, and so on for each. For example consider: ERBB2 (HER2), KMT2A (MLL), and DMD.
There are two major sources for gene names:
- Internal database ids that reflect the identity of a gene within a repository: genbank ids (unique per entry), accession numbers (these may have versions and refer to updated data for a given entity) and locus numbers (these refer to a location on a genome). Each database may have its own rules: ucsc ids, ensembl ids etc.
- Biologically meaningful gene nomenclature that conforms to the rules established for a given organism. These rules may be different for different organisms. You can read more details here: http://en.wikipedia.org/wiki/Gene_nomenclature See the list for different organisms.
For example the rules for naming human genes can be seen here: http://www.genenames.org/guidelines.html
Science works by domain. Even if a name is not so great in the big scheme, people in a given group stick to their group standards. Authors in the life sciences use not what seems as strange names in their articles, because in the context of the article, the given name is obvious to all (human) readers. Names are created by authors and make sense usually in the context of a publication. Different people can name things differently.
The curators of the main databases that collect these gene names (RefSeq, HGNC, OMIM, UniProt for human, MGI for mice, flybase for flies, etc) then add all synonyms that they find in the papers they read which means that there are many synonyms that are not unique to a gene and seem weird when you look at the list.