Handling Duplicate Probe Expression Values In Spotted Cdna Microarray
2
6
Entering edit mode
11.7 years ago
Sudeep ★ 1.7k

Dear All,

I am working on a publicly available microarray dataset from GEO. I am interested in finding out correlated genes in this dataset. This data set is from a custom spotted cDNA microarray. When I calculated pearson correlation on M values retrieved after normalization (using limma), I see that duplicate probes appear as highly correlated (cor. values from 0.99 - 0.60). Now my question is how should I handle the expression values of these duplicated spots ? Should I take mean or the highest value of these probes? I was searching on this for some time, but I couldn't find anything.

Thank you in advance.

microarray correlation • 8.3k views
ADD COMMENT
15
Entering edit mode
11.6 years ago

My recommendation based on experience with oligo arrays (e.g., Affymetrix expression arrays) is to not combine with a simple average/median/etc when you have more than one probe set supposedly querying the same gene. Take the example of ESR1 (Estrogen receptor), a very important gene in breast cancer. On the U133A array this is represented by 9 different probe sets, only one of which works as expected (see figure below). Averaging produces a terrible result. Even the cleverly re-defined custom probe sets from the Michigan group don't perform well in this case (although generally they work much better than Affy's standard probe set definitions).

What you should do does probably depend on your final goal. But, if your final goal involves identifying differentially expressed genes between different conditions or using expression values in a clustering or classifying exercise then I suggest:

  1. Choose the probe set/spot with the highest variance (across all samples in your study) for each gene. This is the kind of filtering you are likely to do anyways to reduce multiple-testing problem, is unbiased with respect to your comparison, and will avoid the issue of averaging out real signal with noise.
  2. An even safer option (in some ways) is to just leave all probe sets/spots in your analysis until the very final stage of biological interpretation. This way each probe set corresponding to a gene gets a chance. That can also be helpful if multiple probe sets map to the same gene locus but actually represent different transcripts.

Figure explanation: The figure shows a set of several hundred breast cancer samples which were expected to be predominantly ER-positive but with a few ER-negative samples mixed in. The last probe set at the bottom shows the expected pattern of nice strong expression for most samples with a small subset showing very weak expression. Other probe sets show only a weak distinction between ER+ and ER- samples or don't have any discernible expression at all.

enter image description here

ADD COMMENT
0
Entering edit mode

Thank you.. I see that you have a point. Could you also give the paper from which this figure is taken ?

ADD REPLY
1
Entering edit mode

No problem. Unfortunately that figure was generated by me and not yet in a paper. But, this specific issue with ER on the U133A has been described previously in PMID: 17329190.

ADD REPLY
0
Entering edit mode

For some strange reason, I knew that probe 205225_at was ESR1 (oestrogen receptor alpha 1) without even checking...

ADD REPLY
10
Entering edit mode
11.7 years ago

This question is one of those that are as old the technologies themselves and go back a decade or more. The consensus as far as I know is that the right answer depends on many factors - most which are very subjective.

My opinion is not to dwell on it too long. Pick a safe choice, such as averaging them and carry on with the analysis. There are many far more time consuming aspects of data analysis and interpretation that have far larger impact on the outcome than picking on or the other of your choices.

ADD COMMENT

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6