Question

gene co-expression on microarray assembled from GDS datasets is influenced by strong log2 negative outliers

0

Entering edit mode

9.4 years ago

grokaine ▴ 40

I need to compute gene co-expression for a compendium of GEO microarrays. I downloaded a number of GDS datasets corresponding to two GPL platforms and merged them in a gene expression table. After lo2 transforming them I obtained a lot of negative values. Negative values come from small gene expression values (between 0 and 1), however due to the log2 transformation they create outliers. These outliers are influencing any type of co-expression measurements. The GDS datasets are supposed to be both background corrected and normalized, but I performed quantile normalization to re-align the probe distribution among datasets. I still have too many negative values though.

How do you recommend me to proceed?

Download raw .CEL files and perform unitary background correction/normalization? I saw people saying that this improves the overall quality but I am not convinced. Mainly because these operations are mostly performed to eliminate consistent noise due to specific experimental conditions. Second, negative values are already present in the GDS datasets after all the statistical proofing, so what is to guarantee I will not endup in the same situation, especially since I will use many different experiments?
Add 1.0 to all expression values before log2 transforming them. This is my favored solution.
Not using any log2 transformation (why is this used anyway?). However this would make outliers even stronger.
???

GEO co-expression normalization Microarray • 2.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by grokaine ▴ 40

Ram · Answer 1 · 2014-11-25

There is a correlation (in the qualitative sense, not in the quantitative sense) between low expression values on an array and variance. This is irrespective of the presence or absence of negative values, so I wouldn't focus on the negative values. Log transformation is used to bring the expression measures to a more bell-shaped distribution and to make the variance across expression values more similar.

I would suggest getting the .CEL files and normalizing with rma or frozen RMA. I would not manipulate the output in an ad hoc manner without good evidence to do so; there is over a decade of experience with Affy microarrays that you would potentially be invalidating by doing ad hoc stuff....

Finally, since you are interested in correlations, you can use variance filters on the features to remove features that show little or no variance since these are unlikely to show strong correlations. This will functionally remove the lowest expressed features as well.

score 0 · Answer 2 · 2014-11-25

best is to process from cel files

have affy package, process it its easy and quick, detect signals upto threshold, then transform to log scale,

Negative means that signals are less than one, which would be filtered out when you correct the cel files .

use log2 scale otherwise there would be much variance during comparison