Question

Microarray Across Experiments

2

Entering edit mode

12.1 years ago

Greg Clark ▴ 20

Hi,

I'm new to microarray experiments - have no experience and trying to get a grip. I am using GDS596 (well known Su et al 2004 PNAS data), and trying to get a single expression value for each gene. Essentially, I am looking to replicate the analysis done in Tartaglia et al "Life on the Edge..." TRENDS Biochem Sci 32(5), 2007.

I have obtained the raw data human *.CEL files, and would like some clarification on the steps taken. I have a few questions that come up below.

1.) MAS5 normalization (for background correction via R affy package) - change to take log10 of these values, and then average across genes and across experiments. Fine (also can use rma and gcrma).

2.) The authors then "median scale followed by quantile normalization". So, scaling across experiments (i.e. GSM columns) allows us to make comparisons between experiments. Fine. Although I don't scale row-wise as some other papers do (not sure on why you would do this?).

3.) Then, quantile normalization? Why is this step taken? I had thought that this was done at the probe level. If intensities are normalized (MAS5), and corrected for across experiments (median scaling), why another normalization?

It seems that I find 'ok' correlation (pearson's rho ~.77) with the paper's expression values after first 2 steps, but then quantile normalization screws everything up. Are there obvious things I'm doing wrong here?

Thanks

greg

microarray r gene • 4.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 12.1 years ago by Greg Clark ▴ 20

1

Entering edit mode

One small comment. People typically perform log2 of microarray data. I've rarely (if ever) seen log10. I think gcrma and rma already output log2 data. I'm not sure about mas5. So, you could try various combinations of not logging or log2 instead of log10 to see if you get better correlation with the paper's expression values. I typically just process cel files with gcrma and don't do an additional median scaling and quantile normalization of that data. GCRMA already includes a quantile normalization. Summarizing to the gene level is a separate issue and will depend on which chip you have.

ADD REPLY • link 12.1 years ago by Obi Griffith 20k

0

Entering edit mode

What is the possible solution if you apply GCRMA but you only get 3 genes with lfc greater than 1. Is it possible to read cel files but apply only log2 transformation, as used by geo2R analysis by NCBI GEO? By using geo2R approach for the exact same samples gives me all top 250 genes above lfc=1.

ADD REPLY • link 8.7 years ago by Bioinformatist Newbie ▴ 270

0

Entering edit mode

Please ask a separate question rather than asking a question as a comment to a post.

ADD REPLY • link 8.7 years ago by Sean Davis 26k

0

Entering edit mode

@Sean: Check this one: Microarray analysis of CEL files with Log-transformation instead of GCRMA or RMA

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Bioinformatist Newbie ▴ 270

0

Entering edit mode

The base of the log does not affect something like a correlation or statistics related to differential expression. Log2(x) and log10(x) simply differ by a scale constant (3.32193). The log2 and log10 distributions are therefore, identical, except scaled by a constant.

ADD REPLY • link 12.1 years ago by Sean Davis 26k

0

Entering edit mode

Mathematically this is correct, but log2 is convenient because many people find it easier to think about doubled values rather than powers of 10.

ADD REPLY • link 10.3 years ago by David Quigley 11k

0

Entering edit mode

Well, I'm just going to use RMA instead - not rely on the other publications' protocol.

ADD REPLY • link 12.1 years ago by Greg Clark ▴ 20