Question

Machine learning using microarray data

0

Entering edit mode

4.5 years ago

Gene_MMP8 ▴ 240

I have developed a classification model using microarray data using a GEO dataset. Let's call this the train set. The class labels were "extreme" vs "not-so-extreme" disease course. Now my advisor has asked me to see the generalizability of the model. But there is no other dataset with the same set of labels as described above. However, it has been known that the "extreme" disease course often leads to death and vice versa. So now I am looking for datasets with mortality labels [Survivor and Deceased] and also found one. Let's call this the test set. So "Extreme" label of the train has been matched to "Deceased" of test and vice versa.

Here's where the problem starts. I have taken the best set of features, did parameter tuning on the train set alone and now when I am validating on the test, I am getting around 0.50 AUC. I don't know whether it's because of the way I have defined the labels or due to the different microarray platforms using which the data has been collected. The training dataset is based on the "[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array" platform and the test data is based on the "Illumina HumanHT-12_V4_0_R1_15002873_B" platform. Can somebody help?

RNA-Seq machinelearning • 1.0k views

ADD COMMENT • link updated 4.4 years ago by Charles Warden 8.2k • written 4.5 years ago by Gene_MMP8 ▴ 240

score 1 · Answer 1 · 2019-11-11

But there is no other dataset with the same set of labels as described above.

I am puzzled as to why would you train on a dataset that is one of a kind. Even if this is just an exercise, that seems like an odd choice. If choosing a different dataset is an option, that would be my suggestion.

I am sure you know that AUC of 0.5 indicates a random chance. Since that is very difficult to get even with poor classifiers, I suspect that your train and test data do not match. That can be in terms of data type, labels, features, or all of the above.

If you have to stick with this dataset, I recommend that you set 20-25% data aside as validation, and do 5-fold or 10-fold cross-validation on the rest. That way you will get an estimate from your training that can be validated on the same kind of data. You have to do stratified sampling of your train and validation data so that the ratio of labels is preserved.

score 0 · Answer 2 · 2019-11-11

That is a fairly common microarray platform (for the training set) - I wonder if there might be some formatting issue with the data entry.

Yes - if you are getting an essentially random AUC on an independent dataset, then machine learning may not be the best option for your analysis.

Likewise, if you have a high AUC on one dataset but you have done something to violate the independence of the samples (such as upstream feature selection, normalization affecting all of the samples, literally not having an independent test set), then machine learning may also not be a great strategy (even though it may look artificially good in an initial publication).

While it sounds like there may be either a coding issue or a processed data problem, I would still be cautious if you got a result with an AUC close to 1.00. I would try to find more evidence that you are looking at something real and reproducible. So, I don't think you are currently describing a problem with "over-fitting" but I do have a plot to describe that problem with adding complexity that actually makes validation worse:

http://cdwscience.blogspot.com/2019/05/emphasizing-hypothesis-generation-in.html