Integration of microarray and RNA-seq data for machine learning model??
0
1
Entering edit mode
4.8 years ago
Mohamed Omar ▴ 10

I have 6 data sets which I am using to build and test a machine learning model for cancer classification. The problem is that one of these data sets is RNA-seq and I am not sure what is the best way to combine these different types of data: this is what I have done:

  • The microarray data are already normalized. Then, I took the z-score of each data set separately.
  • For the RNA-seq data, I have the raw counts, I filtered the expression matrix keeping only genes with > 1 cpm in at least 50% of the samples, then I normalized it using the following code:

    RNA_expr <- DGEList(RNA_expr)
    RNA_expr <- calcNormFactors(RNA_expr, method = "TMM")
    RNA_expr <- cpm(RNA_expr, log = TRUE, prior.count = 3, normalized.lib.sizes = TRUE)

  • Then, I took the z-score of this RNA-seq data as well.

  • Then, I combined all the data together and devided them into train and test data.

My question: is this a valid approach or not ?? Or should I combine all data first, then take the z-score and finally divide into train and test data?

RNA-Seq Microarrays Machine Learning • 1.3k views
ADD COMMENT
1
Entering edit mode

Plot a PCA of all the data and color by RNAseq/microarray. I bet they will be grouped together. It's ML so you can do whatever you want, just make sure the cancer samples are evenly distributed between the technologies and even library prep and sequencing method otherwise you're be predicting sequencing technology rather than caner.

ADD REPLY
0
Entering edit mode

Thanks Asaf, so if I took the z-score of all the data combined before dividing into training and test data, doesn't this violate the independence between training and testing data or what??

ADD REPLY

Login before adding your answer.

Traffic: 2695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6