Machine learning using microarray data
1
1
Entering edit mode
4.5 years ago
Gene_MMP8 ▴ 240

I am developing machine learning models to classify disease/non-disease patients using gene expression data. I have applied LASSO to select features and built classifiers using some of the top features after feature selection. Now I have to do external validation on an independent test set to judge my model's generalizability. The problem I am facing is while doing this part.
The training set is built using a GEO dataset that is built on [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array platform. However, the test set of my interest that has the same disease/non-disease labels as my training is an Array Express dataset built on A-MEXP-2210 - Illumina HumanHT-12_V4_0_R1_15002873_B platform. So, some of the top features that I would have selected from my test set to validate my model, is missing altogether. What should be the ideal way to validate here?

  1. Use only those genes that are there in test set, select those genes from training set and build a model?
  2. KNN-impute those genes in the test set and do the analysis?
  3. Assign expression value zero for those genes in the test set and do the analysis?
RNA-Seq R • 1.4k views
ADD COMMENT
4
Entering edit mode
4.5 years ago
dsull ★ 5.8k

I don't recommend imputing (zero, knn, or otherwise). If those features are missing altogether in the test set, you can't really infer what values those features (if they were to exist) would have unless you have some prior knowledge (which should not be taken from your training set) about what values those features should have.

The best way is to intersect the features from the training set and testing set, and work off of that for building your model.

If you're afraid of your model losing important features as a result, but still want to validate the model on an independent dataset, then I would go with imputing. Since entire features are missing, you wouldn't be able to use KNN (and you certainly shouldn't use any information at all from the training set for imputation purposes) -- I'd probably just assign the missing features to be the average expression of all genes for that sample. Just don't expect your model to perform as good since you're giving some semi-arbitrary values to some features.

Another way to do validation (if you're only interested in feature selection, i.e. if you just want to see whether you've selected "good" features that can predict disease vs. non-disease) is to take the top features you found, use only the ones that also exist in the second dataset, and then train + cross-validate your model using just the second dataset. This will not tell you whether the model you trained using the first dataset is generalizable to other datasets, but it will tell you whether your features (which you selected using machine learning on your first dataset) can be predictive of disease/non-disease in another dataset.

ADD COMMENT

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6