Hi there,
I've identified around 150 DEGs between an lung adenocarcinoma and squamous cell carcinoma group from a microarray dataset and want to use these genes to create a classifier to predict lung cancer type on a new dataset.
The first dataset I worked with is from Illumina and the second is Affymetrix U133 2.0. I've background corrected, log2 transformed the raw data, and then used Yugene transform (a way to standardize expression between 0 and 1) on both datasets in their entirety. Then I filtered both datasets for only the expression of the 150 DEGs hoping to train a random forest model with the illumina data and test the model on the Affy data.
However, the prediction accuracy is ~65% on the affy data. The reason I'm confused by this is that when I do a PCA using the 150 genes in this data, I see a clear separation visually and an AUC of the weights from PC1 is 0.88. This makes me feel that the problem with my classifier is one within normalization/standardization problems as opposed to having selected poor features for my model. Please let me know if this line of thinking is correct and what I can do to improve the prediction accuracy of my model.
I highly appreciate your time reading through and look forward to any responses