I am currently working on clustering microarray data to find tumor subtypes. My data come from multiple GEO studies, and are all based on the Affymetrix U133 Plus 2.0 array. All samples have been log2-transformed and RMA normalized (on study basis). For the needs of the analysis, I have come up with the following workflow:
(1) Combine all arrays (tumors) into one file.
(2) Define batch effects.
(3) Remove batch effects using (a) pamr and (b) sva.
Q: Is it ok to apply these batch correction procedures to log2 data? Or shall I delogarithmize the data beforehand?
(4) Delogarithmize the data.
Q: Do you think that it would be better not to delogarithmize the data before standardization?
(5) Standardize the data using R (standardize rows, that is, genes).
(6) Cluster all tumors using ConsensusCluster (use k-means with Euclidean distance and SOM).
(7) Select genes whose expression profile differs between the classes found as a result of the clustering (genes that pass a t-test p-value of 0.000001).
Q: Is it ok to use log2-transformed, RMA normalised and batch corrected data for the t-test (do not standardize)?
What flaws do you see in this workflow?