8.5 years ago by
If a lab generates 100 aliquots of RNA from 100 subjects and runs the same aliquots four months apart at the same core facility, I would be unsurprised to see them cluster separately. There are batch effects you introduce even with that level of replication; taking two different experiments, run by two different labs, etc. and not renormalizing the data, and it would be very surprising if you didn't see that.
Start out a more basic point:
You haven't said anything about the experiments you're using as raw data. Are the experiments purportedly measuring the same thing? (e.g. lung adenocarcinomas from early stage tumors, mouse skin treated with UV radiation, whatever) This is the biggest issue. There may be very good biological reasons why the experiments cluster separately, even aside from technical batch effects. Combining other people's data without studying the individual data sets and knowing something about the biological context can be very misleading. I'm not assuming that is what you are doing, but you haven't said anything about this.
For practical suggestions, I would suggest you renormalize the combined data sets together from the CEL files and use a tool such as ComBat to adjust for the known between-experiments batch effects. If you don't have the CEL files, I suggest that at least you use ComBat.