I have 6 data sets which I am using to build and test a machine learning model for cancer classification. The problem is that one of these data sets is RNA-seq and I am not sure what is the best way to combine these different types of data: this is what I have done:
- The microarray data are already normalized. Then, I took the z-score of each data set separately.
For the RNA-seq data, I have the raw counts, I filtered the expression matrix keeping only genes with > 1 cpm in at least 50% of the samples, then I normalized it using the following code:
RNA_expr <- DGEList(RNA_expr) RNA_expr <- calcNormFactors(RNA_expr, method = "TMM") RNA_expr <- cpm(RNA_expr, log = TRUE, prior.count = 3, normalized.lib.sizes = TRUE)
Then, I took the z-score of this RNA-seq data as well.
Then, I combined all the data together and devided them into train and test data.
My question: is this a valid approach or not ?? Or should I combine all data first, then take the z-score and finally divide into train and test data?
Plot a PCA of all the data and color by RNAseq/microarray. I bet they will be grouped together. It's ML so you can do whatever you want, just make sure the cancer samples are evenly distributed between the technologies and even library prep and sequencing method otherwise you're be predicting sequencing technology rather than caner.
Thanks Asaf, so if I took the z-score of all the data combined before dividing into training and test data, doesn't this violate the independence between training and testing data or what??