Question: Specific microarray tumor clustering workflow
gravatar for mjarosz
6.4 years ago by
mjarosz0 wrote:

Hi All,

I am currently working on clustering microarray data to find tumor subtypes. My data come from multiple GEO studies, and are all based on the Affymetrix U133 Plus 2.0 array. All samples have been log2-transformed and RMA normalized (on study basis). For the needs of the analysis, I have come up with the following workflow:
(1) Combine all arrays (tumors) into one file.
(2) Define batch effects.
(3) Remove batch effects using (a) pamr and (b) sva.
Q: Is it ok to apply these batch correction procedures to log2 data? Or shall I delogarithmize the data beforehand?
(4) Delogarithmize the data.

Q: Do you think that it would be better not to delogarithmize the data before standardization?
(5) Standardize the data using R (standardize rows, that is, genes).
(6) Cluster all tumors using ConsensusCluster (use k-means with Euclidean distance and SOM).
(7) Select genes whose expression profile differs between the classes found as a result of the clustering (genes that pass a t-test p-value of 0.000001).
Q: Is it ok to use log2-transformed, RMA normalised and batch corrected data for the t-test (do not standardize)?

What flaws do you see in this workflow?

Best regards,



clustering microarray tumor • 2.2k views
ADD COMMENTlink modified 4.7 years ago by Biostar ♦♦ 20 • written 6.4 years ago by mjarosz0

I'd recommend co-normalizing all arrays (not use the per-study normalization as is).  At step (6) you'll need to decide which genes you will use in the clustering.  Using all genes on the HGU133Plus2 is generally not a good idea, as many probesets will contribute mainly noise.  Instead, use a subset of genes that reflect biological distinctions, defined by (for example) a variance or Absent/Present call filter.  Euclidean distance on standardized log-scale data would be OK.  It is OK to apply batch corrections to log2-scale data.  It would also be OK to run within-gene t-tests or the like, on log2-transformed, RMA normalized and batch corrected data.  If you have more than 2 putative tumor classes, I assume you would run (for example) an ANOVA analysis, not a t-test, at step (7), to identify genes that differ among the putative subtypes.  If you are attempting to build a predictor of putative tumor subtypes, set aside a portion of your samples to serve as a test set, in order to evaluate your classifier.



ADD REPLYlink written 6.4 years ago by Ahill1.9k

Thank you for valuable insight, Ahill. I have got one more question: Would it be also ok to run these batch correction procedures (pamr, sva) on delogarithmized data?

ADD REPLYlink written 6.4 years ago by mjarosz0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2315 users visited in the last hour