3.5 years ago by
Coming from a pure computational background myself I shared your confusion at one point. I'm not sure if you have had any molecular biology training, so you may know some or all of this. Becoming familiar with the "central dogma of molecular biology" will help you make sense of the database and microarray landscapes. The basic dogma is "DNA >transcribed into> RNA >translated into> PROTEIN"; however, its not actually that simple! A "gene" is a region of the genome (not RNA) that generally codes for one or *more* proteins, but the definition of a gene can change as stated above. Thus, a gene is a DNA sequence, and going from a gene to a transcript (the RNA that is actually measured by microarrays) is not necessarily a one-to-one mapping, but can also be a one-to-many mapping. The same applies to a gene-to-protein/transcript-to-protein mapping--it is not one-to-one. There are many things that can and do happen to the RNA *after* it is transcribed, such as alternative splicing, that can lead to the creation different proteins.
Now we come to microarrays...some microarrays only measure messenger RNA (mRNA), but there are other types of RNA too. With mRNA microarrays there is generally a direct mapping of the mRNA transcript to the protein, BUT the probes on the microarray only target a small fragment of the transcripts (with Affymetrix it is 25 nucleotides). Thus, transcripts in the same gene family (like those generated from the same gene (the DNA region) may have high sequence similarity in the region the probe targets, so one probe can actually "cross-hybridize" with multiple mRNA transcripts! The array manufactures generally try to target unique portions, but this is not always possible. I'm not familiar with Illumina arrays, but with Affymetrix any probe ID that has an "_a" is uniquely mapped to one single mRNA, and any probe ID with an "_x" cross hybridizes with multiple transcripts. Hence, when a probe cross-hybridizes the probe ID will map to multiple mRNA transcripts (possibly from different genes), which will then map to multiple proteins.
To simplify things a little I would suggest figuring out which probe IDs uniquely target mRNA transcripts and start with them. Then separate out the multi-mapped probe IDs and figure those out. In the past when I needed to map probe IDs to proteins or genes, I would just duplicate the expression value of the probe for however many proteins it was mapped to and use those values in any network analysis I did--this is NOT ideal, however. When mapping probe IDs to genes or proteins just think of it as a many-to-many problem instead of a one-to-one.