Machine learning procedure for cancer subtype discovery
3
2
Entering edit mode
8.1 years ago
juncheng ▴ 210

I have around 200 cancer patient gene expression data, and want to build a cancer subtype classifier.

What is the correct why way to treat with the data? Divide the data into training, cross validation and test set? Training the classifier with training set, cross validation with cross validation set and test with test set? What is a good proportion of this data sets?

machine-learning cancer-subtypes • 2.8k views
5
Entering edit mode
8.1 years ago

There are different ways to do this, but personally I think it is a good idea to set aside a random hold-out (= what you call test) sample (let's say 20%) before any training has been done in order to get a more unbiased performance estimate at the end of the process. Then you can work with the remaining data, dividing it into training and validation sets for tuning parameters in your model. This can be done with n-fold cross-validation, leave-one-out cross-validation etc. Only when you are happy with the model, evaluate it on the hold-out set. The caret R package is a nice tool for trying different algorithms, tuning parameters, preprocessing and dealing with cross-validation. It is useful to read the whole vignette.

1
Entering edit mode

Good point. I agree it's best to leave out a true test set when you can, but with only 200 samples I wonder if that 20% would better serve to help improve training than for testing. I suppose the appropriate choice depends on the data set (how homogeneous it is, how many cancer subtypes, etc) and the context of the experiment (eg, is your goal to publish?).

0
Entering edit mode

I used PAM implemented in pamr package..

It seams for this package all people use cross validation data the same with training data.

> train.dat <- list(x = dat, y = labels, genenames = gN, geneid = gI,
+ sampleid = sI)
# Training
model <- pamr.train(train.dat)
# Cross Validation, 10 fold
model.cv <- pamr.cv(model, train.dat, nfold = 10)

4
Entering edit mode
8.1 years ago

With only 200 total samples I would be torn between (1) using the entire set for training along with n-fold or leave-one-out cross-validation vs (2) putting aside some number as independent test set. Its possible that you won't have enough samples to allow both robust modeling during training phase and accurate estimation of performance from the independent set. Therefore I might lean toward using the entire set for training if there are any other datasets published that you can use as additional independent test sets? Are your cancer subtypes of approximately equal size in the cohort? If one subtype is much more rare and only represented by a small fraction of the total 200 then dividing into test/train might be even more challenging.

You might find this series of BioStar tutorials useful:

2
Entering edit mode
8.1 years ago
Katie D'Aco ★ 1.0k

If you know the cancer subtype for each patient, then use all 200 samples for training and use cross validation (again, using the entire dataset) to assess your model. There are some good R packages out there for this...can't think of any off the top of my head, but I've used a few that work well.

0
Entering edit mode

Hi, that is what did before. I used PAM implemented in pamr package.

What I worried about is the possible overfitting if I use the entire whole data for both training and CV.

0
Entering edit mode

The results of the CV should give you an idea how much you're overfitting, but if you're worried about it then Mikael Huss' suggestion is a good one.