Question: Handling Duplicate Probe Expression Values In Spotted Cdna Microarray
gravatar for Sudeep
8.4 years ago by
Sudeep1.6k wrote:

Dear All,

I am working on a publicly available microarray dataset from GEO. I am interested in finding out correlated genes in this dataset. This data set is from a custom spotted cDNA microarray. When I calculated pearson correlation on M values retrieved after normalization (using limma), I see that duplicate probes appear as highly correlated (cor. values from 0.99 - 0.60). Now my question is how should I handle the expression values of these duplicated spots ? Should I take mean or the highest value of these probes? I was searching on this for some time, but I couldn't find anything.

Thank you in advance.

correlation microarray • 6.5k views
ADD COMMENTlink written 8.4 years ago by Sudeep1.6k
gravatar for Obi Griffith
8.4 years ago by
Obi Griffith19k
Washington University, St Louis, USA
Obi Griffith19k wrote:

My recommendation based on experience with oligo arrays (e.g., Affymetrix expression arrays) is to not combine with a simple average/median/etc when you have more than one probe set supposedly querying the same gene. Take the example of ESR1 (Estrogen receptor), a very important gene in breast cancer. On the U133A array this is represented by 9 different probe sets, only one of which works as expected (see figure below). Averaging produces a terrible result. Even the cleverly re-defined custom probe sets from the Michigan group don't perform well in this case (although generally they work much better than Affy's standard probe set definitions).

What you should do does probably depend on your final goal. But, if your final goal involves identifying differentially expressed genes between different conditions or using expression values in a clustering or classifying exercise then I suggest:

  1. Choose the probe set/spot with the highest variance (across all samples in your study) for each gene. This is the kind of filtering you are likely to do anyways to reduce multiple-testing problem, is unbiased with respect to your comparison, and will avoid the issue of averaging out real signal with noise.
  2. An even safer option (in some ways) is to just leave all probe sets/spots in your analysis until the very final stage of biological interpretation. This way each probe set corresponding to a gene gets a chance. That can also be helpful if multiple probe sets map to the same gene locus but actually represent different transcripts.

Figure explanation: The figure shows a set of several hundred breast cancer samples which were expected to be predominantly ER-positive but with a few ER-negative samples mixed in. The last probe set at the bottom shows the expected pattern of nice strong expression for most samples with a small subset showing very weak expression. Other probe sets show only a weak distinction between ER+ and ER- samples or don't have any discernible expression at all.

enter image description here

ADD COMMENTlink written 8.4 years ago by Obi Griffith19k

Thank you.. I see that you have a point. Could you also give the paper from which this figure is taken ?

ADD REPLYlink written 8.4 years ago by Sudeep1.6k

No problem. Unfortunately that figure was generated by me and not yet in a paper. But, this specific issue with ER on the U133A has been described previously in PMID: 17329190.

ADD REPLYlink written 8.4 years ago by Obi Griffith19k

For some strange reason, I knew that probe 205225_at was ESR1 (oestrogen receptor alpha 1) without even checking...

ADD REPLYlink written 20 months ago by Kevin Blighe69k
gravatar for Istvan Albert
8.4 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

This question is one of those that are as old the technologies themselves and go back a decade or more. The consensus as far as I know is that the right answer depends on many factors - most which are very subjective.

My opinion is not to dwell on it too long. Pick a safe choice, such as averaging them and carry on with the analysis. There are many far more time consuming aspects of data analysis and interpretation that have far larger impact on the outcome than picking on or the other of your choices.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1922 users visited in the last hour