Question

How to partition samples into training and test sets?

0

Entering edit mode

5.2 years ago

feng1053581201 • 0

Hello everyone! It occurred to me that every time I randomly partitioned my expression matrix with its clinical data into different training sets and validation sets by a 1:1 ratio, lncRNAs screened after univariate Cox analysis, Lasso regression and multivariate Cox analysis are different. I know it is obvious when you process data with different samples, you get different results. But I can't get across it. How can I choose the proper training sets and test sets to do my code and finally get the conclusion?

Another issue is that every time I built up a risk score model based on lncRNAs in the training set, and it showed great prediction in ROC and KM curve, but in the test set, prediction ability is poor. So I guess there must be some restrictions between two sets. For example, the ratio of alive and death event must be close, and the mean and the variance of survival time should be similar. But is that randomly partition? Or I just received wrong results?

Hope you guys give me some advice on the partition. Thank you very much!

RNA-Seq R • 2.3k views

ADD COMMENT • link updated 5.2 years ago by Mensur Dlakic ★ 28k • written 5.2 years ago by feng1053581201 • 0

0

Entering edit mode

training:test ratio is usually 3:1 or 4:1. And you may use K Fold cross validation to validate your model.
your model maybe overfitting. In general, there are too many things you could try to play with the model tuning...

ADD REPLY • link 5.2 years ago by shoujun.gu ▴ 350

0

Entering edit mode

Thank you for your reply! For 1, I will give a try. For 2, after univariate Cox analysis, I use Lasso regresion and Robust Likelihood based survival(rbsurv package in R) to further decrease lncRNAs and intersect lncRNAs were chosen. So overfitting may not happen. Is that right?

ADD REPLY • link 5.2 years ago by feng1053581201 • 0

score 1 · Answer 1 · 2019-09-04

For many datasets random partitioning works fine. That's usually the case for large dataset with balanced distribution of different classes and more or less uniform distribution of different types of data points. Think about it this way: if your data and class distributions in train and test data are similar to each other, you are more likely to get meaningful predictive models that generalize well.

First, you want to make sure that your partitions are stratified with regard to class distribution. That means if your 0:1 ratio in the whole dataset is 1.34:1, you want to have a similar proportion of 0s and 1s in both training and test datasets. If you data has some kind of structure in it beyond different classes (temporal, or other general data categories), you'd want to have those different groups equally represented in all partitions.

One of the simplest ways to partition data manually is to divide the dataset by classes, sort numerically by features in each group, and than sample in consecutive fashion from sorted data. If you want to split the data 50:50, put all odd data points from class 0 into training group, and all odd data points from the same class into testing group. Repeat for each class and join training and testing subgroups, and there's your dataset that is partitioned both by classes and by feature distributions.

If you are into Python, sklearn's model_selection has tools to automate this process. Something similar can be done with caret in R, though I am not sure that's the best package available.