Question: Microarray Across Experiments
gravatar for Greg Clark
8.3 years ago by
Greg Clark20
Greg Clark20 wrote:


I'm new to microarray experiments - have no experience and trying to get a grip. I am using GDS596 (well known Su et al 2004 PNAS data), and trying to get a single expression value for each gene. Essentially, I am looking to replicate the analysis done in Tartaglia et al "Life on the Edge..." TRENDS Biochem Sci 32(5), 2007.

I have obtained the raw data human *.CEL files, and would like some clarification on the steps taken. I have a few questions that come up below.

1.) MAS5 normalization (for background correction via R affy package) - change to take log10 of these values, and then average across genes and across experiments. Fine (also can use rma and gcrma).

2.) The authors then "median scale followed by quantile normalization". So, scaling across experiments (i.e. GSM columns) allows us to make comparisons between experiments. Fine. Although I don't scale row-wise as some other papers do (not sure on why you would do this?).

3.) Then, quantile normalization? Why is this step taken? I had thought that this was done at the probe level. If intensities are normalized (MAS5), and corrected for across experiments (median scaling), why another normalization?

It seems that I find 'ok' correlation (pearson's rho ~.77) with the paper's expression values after first 2 steps, but then quantile normalization screws everything up. Are there obvious things I'm doing wrong here?



gene R microarray • 3.0k views
ADD COMMENTlink modified 6.5 years ago by Biostar ♦♦ 20 • written 8.3 years ago by Greg Clark20

One small comment. People typically perform log2 of microarray data. I've rarely (if ever) seen log10. I think gcrma and rma already output log2 data. I'm not sure about mas5. So, you could try various combinations of not logging or log2 instead of log10 to see if you get better correlation with the paper's expression values. I typically just process cel files with gcrma and don't do an additional median scaling and quantile normalization of that data. GCRMA already includes a quantile normalization. Summarizing to the gene level is a separate issue and will depend on which chip you have.

ADD REPLYlink written 8.3 years ago by Obi Griffith18k

What is the possible solution if you apply GCRMA but you only get 3 genes with lfc greater than 1. Is it possible to read cel files but apply only log2 transformation, as used by geo2R analysis by NCBI GEO? By using geo2R approach for the exact same samples gives me all top 250 genes above lfc=1.

ADD REPLYlink written 4.9 years ago by Bioinformatist Newbie250

Please ask a separate question rather than asking a question as a comment to a post.

ADD REPLYlink written 4.9 years ago by Sean Davis26k

@Sean: Check this one: C: Microarray analysis of CEL files with Log-transformation instead of GCRMA or RMA

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Bioinformatist Newbie250

The base of the log does not affect something like a correlation or statistics related to differential expression. Log2(x) and log10(x) simply differ by a scale constant (3.32193). The log2 and log10 distributions are therefore, identical, except scaled by a constant.

ADD REPLYlink written 8.3 years ago by Sean Davis26k

Mathematically this is correct, but log2 is convenient because many people find it easier to think about doubled values rather than powers of 10.

ADD REPLYlink written 6.5 years ago by David Quigley11k

Well, I'm just going to use RMA instead - not rely on the other publications' protocol.

ADD REPLYlink written 8.3 years ago by Greg Clark20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour