I have a rna seq dataset of 160 samples (80 from patients with a certain disease and 80 from healthy individuals). I want to use a small set of genes to predict disease status and therefore would like to use lasso regression. I am also aware that my dataset is relatively small so I would also like to perform cross-validation to test the model created by the genes I have selected through lasso regression. However, I am not sure on how to do this. Would the following be a good way to use lasso regression and k fold cv to predict disease status? -->
create a training and validation split (70:30)
perform lasso regression to select the best gene combination using the training set only
remove all the gene/gene counts which are not part of the best gene combination in the test set
use all the genes as variables in my test set to calculate the predictive value of my gene combination with k-fold cross validation
You have several links to similar questions on the right side of this page.
On a small dataset such as yours, I don't think that a simple validation will do. A preferred way is to do a Lasso with cross-validation (CV), which will test many alpha parameters (I guess in R implementation it is called a lambda parameter) and find the one that is optimal. A python implementation of that procedure is available:
This will apply different regularizations, where larger alpha/lambda means less regularization, and smaller alpha/lambda means more regularization. Smaller alpha will shrink coefficients of more features (genes, in your case) to zero, thus eliminating a larger number of genes. That could lead to underfitting. Larger alpha will shrink fewer coefficients, which means more genes will be retained. That could lead to overfitting if meaningless genes are included. In short, doing a CV procedure will find optimal alpha that balances everything. I suggest at least 10-fold CV, and even 20-fold might be needed. Once the best alpha/lambda is found, you can print feature coefficients. Those genes that have zeros as feature coefficients can be excluded from modeling.