Question

Handling Duplicate Probe Expression Values In Spotted Cdna Microarray

6

Entering edit mode

11.7 years ago

Sudeep ★ 1.7k

Dear All,

I am working on a publicly available microarray dataset from GEO. I am interested in finding out correlated genes in this dataset. This data set is from a custom spotted cDNA microarray. When I calculated pearson correlation on M values retrieved after normalization (using limma), I see that duplicate probes appear as highly correlated (cor. values from 0.99 - 0.60). Now my question is how should I handle the expression values of these duplicated spots ? Should I take mean or the highest value of these probes? I was searching on this for some time, but I couldn't find anything.

Thank you in advance.

microarray correlation • 8.3k views

ADD COMMENT • link updated 11.7 years ago by Obi Griffith 20k • written 11.7 years ago by Sudeep ★ 1.7k

score 15 · Answer 1 · 2012-08-27

My recommendation based on experience with oligo arrays (e.g., Affymetrix expression arrays) is to not combine with a simple average/median/etc when you have more than one probe set supposedly querying the same gene. Take the example of ESR1 (Estrogen receptor), a very important gene in breast cancer. On the U133A array this is represented by 9 different probe sets, only one of which works as expected (see figure below). Averaging produces a terrible result. Even the cleverly re-defined custom probe sets from the Michigan group don't perform well in this case (although generally they work much better than Affy's standard probe set definitions).

What you should do does probably depend on your final goal. But, if your final goal involves identifying differentially expressed genes between different conditions or using expression values in a clustering or classifying exercise then I suggest:

Choose the probe set/spot with the highest variance (across all samples in your study) for each gene. This is the kind of filtering you are likely to do anyways to reduce multiple-testing problem, is unbiased with respect to your comparison, and will avoid the issue of averaging out real signal with noise.
An even safer option (in some ways) is to just leave all probe sets/spots in your analysis until the very final stage of biological interpretation. This way each probe set corresponding to a gene gets a chance. That can also be helpful if multiple probe sets map to the same gene locus but actually represent different transcripts.

Figure explanation: The figure shows a set of several hundred breast cancer samples which were expected to be predominantly ER-positive but with a few ER-negative samples mixed in. The last probe set at the bottom shows the expected pattern of nice strong expression for most samples with a small subset showing very weak expression. Other probe sets show only a weak distinction between ER+ and ER- samples or don't have any discernible expression at all.

enter image description here

score 10 · Answer 2 · 2012-08-24

This question is one of those that are as old the technologies themselves and go back a decade or more. The consensus as far as I know is that the right answer depends on many factors - most which are very subjective.

My opinion is not to dwell on it too long. Pick a safe choice, such as averaging them and carry on with the analysis. There are many far more time consuming aspects of data analysis and interpretation that have far larger impact on the outcome than picking on or the other of your choices.