Question: An idea to identify batch effects
gravatar for mjarosz
5.8 years ago by
mjarosz0 wrote:

Hi All,

I would like to ask you to comment on my idea to identify batch effects in a set of Affymetrix arrays coming from different studies:

I am thiniking about discovering batches by clustering Affymetrix control probes using k-means or SOM. However, I am afraid that the differences between batches might be so small that clustering output will not correspond to real batches. What do you think about it?

batch effect microarray • 3.2k views
ADD COMMENTlink modified 5.8 years ago by Ann2.3k • written 5.8 years ago by mjarosz0
gravatar for Irsan
5.8 years ago by
Irsan7.1k wrote:

Perform analysis of variance (anova) of all relevant sample attributes (of course batch included and e.g. diseaseState, driverMutation, sex, tumorSize, histology, ...) on the normalized log expression values. This way you can see what the relative/proportional influence of batch is on the expression estimates. So if you have your expression matrix (in R/Bioconductor: yourExpressionMatrix <- exprs(yourExpressionSet) melt the matrix (with e.g. melt() function from reshape library in R) so that sample and probe and expression estimate become columns and add sample attributes as columns. Then perform ANOVA (with aov() from R base)

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Irsan7.1k

Irsan, thank you very much. I will give it a try.

ADD REPLYlink written 5.8 years ago by mjarosz0

Hi Irsan,

I would like to ask you two questions about the details of preparing data for ANOVA. I have decided that all samples from the same study scanned on the same day make up a batch.
(1) There are some batches which contain only one or two samples. In my opinion, these batches should be discarded from further analysis. However, this way I am loosing valuable data (49 out of 397 samples). How do you resolve such issues in your analyses?
(2) There are some batches which contain only tumors (or controls), not both. What would you suggest me to do with them?

Best regards,

ADD REPLYlink written 5.7 years ago by mjarosz0

Unless I misunderstand your questions, they are not related to preparing the data for ANOVA. But still here are my answers:

(1) In stead of discarding samples, include batch as a covariate covariate in your differential expression workflow: design <- model.matrix(~ Batch + isTumor). This way, the resulting fold changes are differences in gene expression between tumor and normal samples corrected for the batch effect.

(2) Keep all your samples and if you are worried about batch effect use batch as a covariate in your design matrix as described above.

ADD REPLYlink written 5.7 years ago by Irsan7.1k


In fact, I was thinking about preparing data for discovering batch effects using the aov function as you suggested. What shall I do for (1) and (2) in this case?

ADD REPLYlink written 5.7 years ago by mjarosz0
gravatar for Ann
5.8 years ago by
Concord NC USA
Ann2.3k wrote:

Are you talking about using the mismatch probes because they measure background?

That sounds like a neat idea. Could be worth a try.

However you should take a look at how people are using hierarchical clustering and PCA to discover bias in data. Look at the limma vignette in Bioconductor and also Bioconductor Case Studies (book by Gentleman and friends).

I use R/Bioconductor methods "hclust" and "plotMDS" to find out if samples got switched or if there are batch effects. Then, if there *are* batch effects, I try to account for them using linear modeling in limma (microarray) or edgeR (RNA-Seq). But there are many methods for doing this -- those just happen to be the ones I am most familiar with.

If you'd like code examples let me know and I will post a link.

Also, a tip: Bioconductor has a method that lets you check the scan date on arrays. It's amazing how often people scan their arrays on completely different dates, sometimes years apart! I bet that scan date is the biggest source of batch effects in microarray experiments. I'd be very interested to read a study of this. If you find one please let me know?




ADD COMMENTlink written 5.8 years ago by Ann2.3k

Hi ann, I misread your post. I thought you were the same person asking the question. My mistake. I changed my comment

Yes some people use unsupervised hierarchical clustering (UHC) and multidimensional scaling (MDS) and/or principal component analysis (PCA) for the semi-quantitive analysis on how much influence each of the sample attributes has on the expression profile of a sample. However, when you are interested in such a thing you are better off with analysis of variance (ANOVA). This general statistical method gives you what you want in a complete quantative way. As the OP intuitively sensed yourself, when batch effect is only small (but truly is present and affecting your analysis) it is likely overlooked by UHC/MDS/PCA.

BTW, if you are considered that a non-biological factor like batch is contributing to variance in your expression estimates just use it as a covariate in your differential testing formula.

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Irsan7.1k

Dear Ann,

Thank you for your insight.

I was thinking about Affymetrix control probes (there are, for example, 62 of them on HG133 Plus 2.0), not the mismatch ones - I am not sure if the difference in mismatch probe signal between batches would be detectable.

You are right about scan dates: in one of the studies I am analysing, each array has been scanned on a different day.

ADD REPLYlink written 5.8 years ago by mjarosz0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1324 users visited in the last hour