Question: How to partition samples into training and test sets?
0
gravatar for feng1053581201
10 months ago by
feng10535812010 wrote:

Hello everyone! It occurred to me that every time I randomly partitioned my expression matrix with its clinical data into different training sets and validation sets by a 1:1 ratio, lncRNAs screened after univariate Cox analysis, Lasso regression and multivariate Cox analysis are different. I know it is obvious when you process data with different samples, you get different results. But I can't get across it. How can I choose the proper training sets and test sets to do my code and finally get the conclusion?

Another issue is that every time I built up a risk score model based on lncRNAs in the training set, and it showed great prediction in ROC and KM curve, but in the test set, prediction ability is poor. So I guess there must be some restrictions between two sets. For example, the ratio of alive and death event must be close, and the mean and the variance of survival time should be similar. But is that randomly partition? Or I just received wrong results?

Hope you guys give me some advice on the partition. Thank you very much!

rna-seq R • 1.0k views
ADD COMMENTlink modified 10 months ago by Mensur Dlakic5.8k • written 10 months ago by feng10535812010
  1. training:test ratio is usually 3:1 or 4:1. And you may use K Fold cross validation to validate your model.
  2. your model maybe overfitting. In general, there are too many things you could try to play with the model tuning...
ADD REPLYlink written 10 months ago by shoujun.gu310

Thank you for your reply! For 1, I will give a try. For 2, after univariate Cox analysis, I use Lasso regresion and Robust Likelihood based survival(rbsurv package in R) to further decrease lncRNAs and intersect lncRNAs were chosen. So overfitting may not happen. Is that right?

ADD REPLYlink written 10 months ago by feng10535812010
1
gravatar for Mensur Dlakic
10 months ago by
Mensur Dlakic5.8k
USA
Mensur Dlakic5.8k wrote:

For many datasets random partitioning works fine. That's usually the case for large dataset with balanced distribution of different classes and more or less uniform distribution of different types of data points. Think about it this way: if your data and class distributions in train and test data are similar to each other, you are more likely to get meaningful predictive models that generalize well.

First, you want to make sure that your partitions are stratified with regard to class distribution. That means if your 0:1 ratio in the whole dataset is 1.34:1, you want to have a similar proportion of 0s and 1s in both training and test datasets. If you data has some kind of structure in it beyond different classes (temporal, or other general data categories), you'd want to have those different groups equally represented in all partitions.

One of the simplest ways to partition data manually is to divide the dataset by classes, sort numerically by features in each group, and than sample in consecutive fashion from sorted data. If you want to split the data 50:50, put all odd data points from class 0 into training group, and all odd data points from the same class into testing group. Repeat for each class and join training and testing subgroups, and there's your dataset that is partitioned both by classes and by feature distributions.

If you are into Python, sklearn's model_selection has tools to automate this process. Something similar can be done with caret in R, though I am not sure that's the best package available.

ADD COMMENTlink written 10 months ago by Mensur Dlakic5.8k

Thank you very much! So dividing groups by their classes is needed. I got it. I don't want to split mannually. It's almost 400 samples in my study. In my previous process, I used createDataPartition funtion in caret package to split. However, it can only partition groups by one kind of categories, i.e., I can only stratify data by either vital status or survival time, not both. Therefore, when I stratify data by event, means and variances of survival time are far from each other, while P-values examined by t test are almost 0.9. I will search for more information about caret. Otherwise, I will go into sklearn. Anyway, thank you!

ADD REPLYlink written 10 months ago by feng10535812010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1763 users visited in the last hour