Question: How To Transform Microarray Data To Adjust For Batch Effects
gravatar for David Quigley
8.7 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

I've downloaded someone else's microarray data (Affymetrix HG-133plus2, normalized with GCRMA) and noticed many unexpected genes were differentially expressed with the patient's sex (about 30 males, 30 females). Although a few genes (e.g. Y-chromosome located EIF1AY) will have obvious sex-linkage in any human sample, such effects are not usually so strong or pervasive in my experience. I checked the headers in the CEL files and noticed a very strong batch effect: files processed in years one and two were overwhelmingly male, while year three were all female. I concluded the effect is due to technical variation, or at least it cannot be distinguished from such bias.

Many tools such as SAM allow you to specify batches. However, I wish to do downstream analysis using my own methods. What is the best approach to transform the data set to reduce the batch effect? I am resigned to losing any ability to detect true sex-specific gene expression. If I were only performing linear modeling I could include the batch as a factor in my model. However, I'd like to (for example) analyze correlation using Spearman's rank correlation, for which I don't know an obvious solution.

A quick literature search turned up Johnson Biostatistics 2007, "Adjusting batch effects in microarray expression data using empirical Bayes methods", which in turn references Benito Bioinformatics 2003, "Adjustment of systematic microarray data biases". Before I dive in any further, anyone expert in this area want to comment on best practices?

data modeling microarray • 4.2k views
ADD COMMENTlink modified 8.5 years ago by Daniel Swan13k • written 8.7 years ago by David Quigley11k
gravatar for Daniel Swan
8.7 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

I have always used ComBat.R (from the Johnson Biostatistics paper you mention) to do batch correction on datasets. It's performed very well on our datasets with marked batch variation. I can't say it's best practice, but I can certainly recommend it.

ADD COMMENTlink written 8.7 years ago by Daniel Swan13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 932 users visited in the last hour