Why Is Correlation In Gene Expression Usually Done In Log-Space And Not Linear Intensity-Space?
2
3
Entering edit mode
11.0 years ago
Brian Tsai ▴ 100

I'm trying to compute correlation between two genes across multiple samples from a microarray analysis (Affy). The pearson correlation changes depending on whether i compute the correlation in the original intensity domain or in log2-domain. I'm told that i should be doing this in log2-space, but i'm not sure what the reasoning is?

gene • 25k views
0
Entering edit mode

I'm not an expert on statistics. But usually you got better statistic properties after log transformation. For example, in some cases, you'll have a distribution closer to normal distribution after taking the log, which is more trackable statistically.

5
Entering edit mode
11.0 years ago

You shouldn't really be doing any "downstream analysis" using original intensity data, but rather use some normalized version of the data.

You might find this useful:

In particular, look at the "Explicit error models" section where it mentions that "... a log-transform decouples a random multiplicative error (e^n) from true signal intensity...", and the related reference (Brown et al.)

0
Entering edit mode
11.0 years ago

Correlation is usually computed on log2 data because regardless of the normalization method (e.g. RMA), that's the scale typically used for microarray analysis. The reason for this is nicely stated in the manual to Cluster 3.0:

The results of many DNA microarray experiments are fluorescent ratios. Ratio measurements are most naturally processed in log space. Consider an experiment where you are looking at gene expression over time, and the results are relative expression levels compared to time 0. Assume at timepoint 1, a gene is unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0, while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the difference between values would think that the 2-fold up change was twice as significant as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2 for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold down are symmetric about 0. For most applications, we recommend you work in log space.

2
Entering edit mode

Although if the poster is using Affy arrays then there's no 'ratio' of two samples.

0
Entering edit mode

It's true that one-color arrays don't present data as sample/reference. However, the larger point that log2-transforming the data makes the fold-change values symmetric (50 vs 100, 100 vs 200, both fold-change of 2) holds for Affy data.