Question

Why Is Correlation In Gene Expression Usually Done In Log-Space And Not Linear Intensity-Space?

3

Entering edit mode

12.5 years ago

Brian Tsai ▴ 100

I'm trying to compute correlation between two genes across multiple samples from a microarray analysis (Affy). The pearson correlation changes depending on whether i compute the correlation in the original intensity domain or in log2-domain. I'm told that i should be doing this in log2-space, but i'm not sure what the reasoning is?

gene • 26k views

ADD COMMENT • link updated 12.5 years ago by David Quigley 11k • written 12.5 years ago by Brian Tsai ▴ 100

0

Entering edit mode

I'm not an expert on statistics. But usually you got better statistic properties after log transformation. For example, in some cases, you'll have a distribution closer to normal distribution after taking the log, which is more trackable statistically.

ADD REPLY • link 12.5 years ago by Vitis ★ 2.5k

score 5 · Answer 1 · 2011-10-18

You shouldn't really be doing any "downstream analysis" using original intensity data, but rather use some normalized version of the data.

You might find this useful:

There is no silver bullet -- a guide to Low-level Data Transforms and Normalization Methods for Microarray Data (PDF)

In particular, look at the "Explicit error models" section where it mentions that "... a log-transform decouples a random multiplicative error (e^n) from true signal intensity...", and the related reference (Brown et al.)

Ram · Answer 2 · 2011-10-18

Correlation is usually computed on log2 data because regardless of the normalization method (e.g. RMA), that's the scale typically used for microarray analysis. The reason for this is nicely stated in the manual to Cluster 3.0:

The results of many DNA microarray experiments are fluorescent ratios. Ratio measurements are most naturally processed in log space. Consider an experiment where you are looking at gene expression over time, and the results are relative expression levels compared to time 0. Assume at timepoint 1, a gene is unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0, while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the difference between values would think that the 2-fold up change was twice as significant as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2 for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold down are symmetric about 0. For most applications, we recommend you work in log space.