Question: Combining Gene Expression Microarray Datasets
gravatar for Saman
8.5 years ago by
U of Alberta
Saman250 wrote:

Hi, I am trying to combine several microarray dataset downloaded from GEO, all made by the same technology (GPL96) and normalized with the same algorithm (RMA). I thought all of these similarities between them make them statistically comparable but it seems I was wrong.

A simple hierarchical clustering based on Euclidean distance shows that instances of each dataset are cluster together!

I read about algorithms like DWD (Distance Weighted Discrimination) method for combining datasets but still I have a hard time using it mainly because it doesn't have an R implementation.

Any suggestions here?

Thanks in advance


gene data meta microarray • 9.5k views
ADD COMMENTlink modified 5.7 years ago by avi4you20 • written 8.5 years ago by Saman250

I think, you should not use euclidean distance in this case. Pearson based distances would be a better choice.

ADD REPLYlink written 8.5 years ago by Puthier250

I am not quite sure what do you mean?! Not using Euclidean distance for what?!

ADD REPLYlink written 8.5 years ago by Saman250

For the clustering. You are using Euclidian distance for the clustering, but there are other possible choices to measure the distance between two profiles. See wikipedia "euclidian distance" for more details.

ADD REPLYlink written 8.5 years ago by David Quigley11k

Thanks both of you, I already forgot my post!! So you mean that if I use Pearson correlation for distance then I wouldn't see that effect?! I can check that. I will let you know whether this makes a different or not.

ADD REPLYlink written 8.5 years ago by Saman250

Can i take some cel files for disease1 from experiment1 and some cel files for the same disease1 from experiment2 and similarly ,,taking raw data and then normalizing together ,is it a good idea ?

ADD REPLYlink written 4.6 years ago by mohitjha230

Hi Saman

I am keen to combine multiple GEO datasets (all run on Affymetrix U133 plus 2.0) and came across your thread. I was wondering what approach you ended up using in order to combine your datasets? I would appreciate any help.


ADD REPLYlink written 2.6 years ago by bcbio_uk0
gravatar for David Quigley
8.5 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

If a lab generates 100 aliquots of RNA from 100 subjects and runs the same aliquots four months apart at the same core facility, I would be unsurprised to see them cluster separately. There are batch effects you introduce even with that level of replication; taking two different experiments, run by two different labs, etc. and not renormalizing the data, and it would be very surprising if you didn't see that.

Start out a more basic point:

You haven't said anything about the experiments you're using as raw data. Are the experiments purportedly measuring the same thing? (e.g. lung adenocarcinomas from early stage tumors, mouse skin treated with UV radiation, whatever) This is the biggest issue. There may be very good biological reasons why the experiments cluster separately, even aside from technical batch effects. Combining other people's data without studying the individual data sets and knowing something about the biological context can be very misleading. I'm not assuming that is what you are doing, but you haven't said anything about this.

For practical suggestions, I would suggest you renormalize the combined data sets together from the CEL files and use a tool such as ComBat to adjust for the known between-experiments batch effects. If you don't have the CEL files, I suggest that at least you use ComBat.

ADD COMMENTlink written 8.5 years ago by David Quigley11k

Whether you want to normalise these together though, I'm not sure it's a great idea. I think you should normalise them separately and then use an appropriate meta-analysis method to analyse them. At this level, you probably don't even need to use ComBat - you can treat them as combined, but separate experiments, rather than attempting to push them all through one giant normalisation/batch effect removal step.

ADD REPLYlink written 8.5 years ago by Daniel Swan13k

Thanks for your fast response. All microarray samples belong to breast cancer patients with more or less the same conditions. My main purpose is to learn a better model using a wider range of training samples. I actually downloaded raw files for each dataset and normalized them separately. I thought, and still think, that normalizing different datasets together is not a good idea, aside the problem that using R for normalizing 1000 instances needs more than 8GB memory! Is there any reason to believe that normalizing them together is a good idea?

Thanks again

ADD REPLYlink written 8.5 years ago by Saman250

Samam there is already a discussion about CEL file normalisation with large numbers of chips here:

ADD REPLYlink written 8.5 years ago by Daniel Swan13k

Thanks. I read them, the main issue in that thread is memory limitation. My main concern is validity of normalizing several datasets together. Any references here?!

ADD REPLYlink written 8.5 years ago by Saman250

Another approach would be to replace "normalize together" with "median-center the datasets".

ADD REPLYlink written 8.5 years ago by David Quigley11k

I have seen some studies that tried median centering data in every possible combination, whole dataset first, then each gene separately, then again dataset, ... The bottom line was that there was no improvement.

I have tried making each gene/probe-set z-score in each dataset and observed that it really doesn't matter in the accuracy of prediction.

ADD REPLYlink written 8.5 years ago by Saman250
gravatar for Daniel Swan
8.5 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

First of all I endorse David's answer entirely, ComBat.R is the R implementation you want to use to remove dataset bias in this case.

The DWD approach, the paper claims, allows you to combine datasets but really it adjusts for systematic bias, rather the same thing ComBat.R does. I realise the authors in the paper argue that you can combine different array platforms using this technique, but it doesn't look like a traditional meta-analysis approach.

Combining ComBat.R with a dedicated meta-analysis package in the BioConductor arsenal may be the way to go: GeneMeta or metaArray or RankProd might suit you.

ADD COMMENTlink written 8.5 years ago by Daniel Swan13k

I have seen ComBat.R but for some reason I didn't try it, I will try it and let you know how it works. Thanks.

ADD REPLYlink written 8.5 years ago by Saman250
gravatar for avi4you
5.7 years ago by
avi4you20 wrote:

hello every one i am a student of genetics doing my masters i am working on Diffrential gene expression in avian influenza virus infection in chicken, we have used microarray to know this. what i want to do is i want to analyze two different microarray raw data with available from public database to compare with my data, but as i am a beginer i dont know how to deal with Raw data normalization to compare them and also dont know how to deal with batch effect . can any one help me regarding this topic??.. thank you

ADD COMMENTlink written 5.7 years ago by avi4you20
gravatar for Timtico
8.5 years ago by
Timtico330 wrote:

Or one could use an internal control. Calculate ratio's with genes from which you know they should be expressed equally in any of the datasets?

ADD COMMENTlink written 8.5 years ago by Timtico330

I think actually picking something sensible as a housekeeping baseline is very hard indeed. Every time I look at a classic 'housekeeping' gene in a microarray dataset, I'm surprised just how variable they can be.

ADD REPLYlink written 8.5 years ago by Daniel Swan13k

that can indeed be an issue, but in our arrays in many cases the actin levels are equal and we can define a %-actin as a value for the level of gene expression when comparing different arrays.

ADD REPLYlink written 8.5 years ago by Timtico330
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2001 users visited in the last hour