Hello everyone! It occurred to me that every time I randomly partitioned my expression matrix with its clinical data into different training sets and validation sets by a 1:1 ratio, lncRNAs screened after univariate Cox analysis, Lasso regression and multivariate Cox analysis are different. I know it is obvious when you process data with different samples, you get different results. But I can't get across it. How can I choose the proper training sets and test sets to do my code and finally get the conclusion?
Another issue is that every time I built up a risk score model based on lncRNAs in the training set, and it showed great prediction in ROC and KM curve, but in the test set, prediction ability is poor. So I guess there must be some restrictions between two sets. For example, the ratio of alive and death event must be close, and the mean and the variance of survival time should be similar. But is that randomly partition? Or I just received wrong results?
Hope you guys give me some advice on the partition. Thank you very much!
Thank you for your reply! For 1, I will give a try. For 2, after univariate Cox analysis, I use Lasso regresion and Robust Likelihood based survival(rbsurv package in R) to further decrease lncRNAs and intersect lncRNAs were chosen. So overfitting may not happen. Is that right?