Question

Machine learning using microarray data

1

Entering edit mode

4.5 years ago

Gene_MMP8 ▴ 240

I am developing machine learning models to classify disease/non-disease patients using gene expression data. I have applied LASSO to select features and built classifiers using some of the top features after feature selection. Now I have to do external validation on an independent test set to judge my model's generalizability. The problem I am facing is while doing this part.
The training set is built using a GEO dataset that is built on [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array platform. However, the test set of my interest that has the same disease/non-disease labels as my training is an Array Express dataset built on A-MEXP-2210 - Illumina HumanHT-12_V4_0_R1_15002873_B platform. So, some of the top features that I would have selected from my test set to validate my model, is missing altogether. What should be the ideal way to validate here?

Use only those genes that are there in test set, select those genes from training set and build a model?
KNN-impute those genes in the test set and do the analysis?
Assign expression value zero for those genes in the test set and do the analysis?

RNA-Seq R • 1.4k views

ADD COMMENT • link updated 4.5 years ago by dsull ★ 5.8k • written 4.5 years ago by Gene_MMP8 ▴ 240

score 4 · Accepted Answer · 2019-10-15

I don't recommend imputing (zero, knn, or otherwise). If those features are missing altogether in the test set, you can't really infer what values those features (if they were to exist) would have unless you have some prior knowledge (which should not be taken from your training set) about what values those features should have.

The best way is to intersect the features from the training set and testing set, and work off of that for building your model.

If you're afraid of your model losing important features as a result, but still want to validate the model on an independent dataset, then I would go with imputing. Since entire features are missing, you wouldn't be able to use KNN (and you certainly shouldn't use any information at all from the training set for imputation purposes) -- I'd probably just assign the missing features to be the average expression of all genes for that sample. Just don't expect your model to perform as good since you're giving some semi-arbitrary values to some features.

Another way to do validation (if you're only interested in feature selection, i.e. if you just want to see whether you've selected "good" features that can predict disease vs. non-disease) is to take the top features you found, use only the ones that also exist in the second dataset, and then train + cross-validate your model using just the second dataset. This will not tell you whether the model you trained using the first dataset is generalizable to other datasets, but it will tell you whether your features (which you selected using machine learning on your first dataset) can be predictive of disease/non-disease in another dataset.