Question

Gene Identifiers and Databases

0

Entering edit mode

8.8 years ago

Sirio ▴ 30

Apologies if this may be a naïve question, my background is mainly statistics/machine learning, but recently started working in this field.

I tried to get my head around the different databases that are used to describe genes/transcripts. But this figure (http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php) continued to confuse me rather than helping me. I would like pointers to any tutorial/paper that answers these questions for me:

Why are there multiple databases to describe the same gene?
Which database is used where?
Is there an overarching database that is regularly maintained?
Is there some form of naming convention (similar to what we tend to use in programming) for genes?
(in R if possible) how to simply translate a microarray Illumina Probe ID e.g 130070 to an annotation

I naïvely thought that an Illumina ProbeID e.g 130070 would translate to "protein X involved in pathways Y and Z and potentially in W too"

Thanks,
Sirio

sequencing geneID gene database • 2.0k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by Sirio ▴ 30

Ram · Answer 1 · 2015-07-29

There are multiple databases primarily because of historical reasons and because at different points in time, different groups or institutions wanted to focus on different things. If any existing database didn't quite fit their needs, they built one that did. These days almost all of the major databases linkout to each other or automatically incorporate data from each other into their own database.

Different databases do have slightly different flavours. If I'm more interested in the protein (versus the gene) and particularly its structure and function I'll go to Uniprot. If I'm interested in transcripts and their annotations I'll probably use Ensembl. If I'm looking for function of the gene in general I'll probably just toss the gene name into google and hit GeneCards, NCBI, and the results from a few other databases. A lot of the same information is replicated between the databases but the way it is laid out may make it easier to spot or find certain things I am looking for. Some of them also will have information the other doesn't.

Yes there are naming conventions for genes but they are regulated at the organismal level typically. There are a few historical "flavours" of gene naming styles (yeast versus drosophila versus E coli versus human for example) from various model organisms so there can be some hiccups when the homolog of a gene is named slightly differently in one organism versus the other.

Ram · Answer 2 · 2015-07-29

Coming from a pure computational background myself I shared your confusion at one point. I'm not sure if you have had any molecular biology training, so you may know some or all of this. Becoming familiar with the "central dogma of molecular biology" will help you make sense of the database and microarray landscapes. The basic dogma is "DNA >transcribed into> RNA >translated into> PROTEIN"; however, its not actually that simple! A "gene" is a region of the genome (not RNA) that generally codes for one or more proteins, but the definition of a gene can change as stated above. Thus, a gene is a DNA sequence, and going from a gene to a transcript (the RNA that is actually measured by microarrays) is not necessarily a one-to-one mapping, but can also be a one-to-many mapping. The same applies to a gene-to-protein/transcript-to-protein mapping--it is not one-to-one. There are many things that can and do happen to the RNA after it is transcribed, such as alternative splicing, that can lead to the creation different proteins.

Now we come to microarrays...some microarrays only measure messenger RNA (mRNA), but there are other types of RNA too. With mRNA microarrays there is generally a direct mapping of the mRNA transcript to the protein, BUT the probes on the microarray only target a small fragment of the transcripts (with Affymetrix it is 25 nucleotides). Thus, transcripts in the same gene family (like those generated from the same gene (the DNA region) may have high sequence similarity in the region the probe targets, so one probe can actually "cross-hybridize" with multiple mRNA transcripts! The array manufactures generally try to target unique portions, but this is not always possible. I'm not familiar with Illumina arrays, but with Affymetrix any probe ID that has an "_a" is uniquely mapped to one single mRNA, and any probe ID with an "_x" cross hybridizes with multiple transcripts. Hence, when a probe cross-hybridizes the probe ID will map to multiple mRNA transcripts (possibly from different genes), which will then map to multiple proteins.

To simplify things a little I would suggest figuring out which probe IDs uniquely target mRNA transcripts and start with them. Then separate out the multi-mapped probe IDs and figure those out. In the past when I needed to map probe IDs to proteins or genes, I would just duplicate the expression value of the probe for however many proteins it was mapped to and use those values in any network analysis I did--this is NOT ideal, however. When mapping probe IDs to genes or proteins just think of it as a many-to-many problem instead of a one-to-one.

Ram · Answer 3 · 2015-07-29

Why are there multiple databases to describe the same gene?

Because there are different genomes and different ways of annotating them i.e. there are different definitions of what a gene is.

Which database is used where?

Not sure what you mean here. Typically one would use a database that has data relevant to the problem at hand.

Is there an overarching database that is regularly maintained?

There are several good resources out there. It depends on the data type. For genomics-type of information, I would recommend Ensembl.

Is there some form of naming convention (similar to what we tend to use in programming) for genes?

Yes for most model organisms. For human genes, there's the HGNC.

(in R if possible) how to simply translate a microarray Illumina Probe ID e.g 130070 to an annotation

Getting from probes to genes or proteins involves mapping the sequences of the probes to some reference genome or transcriptome and using the corresponding annotations to infer the targeted genes/proteins.