Lasso regression and cross-validation on RNA seq dataset
1
1
Entering edit mode
5 weeks ago
Jipjop ▴ 10

Hi,

I'm trying to perform lasso regression and cross-validation on a RNA seq dataset to create a combination of RNAs that could most accurately predict a disease status. However, I am not sure if what I'm doing is the best way and would like some advise if this is the best way to go forward. Also, I have a question on how to view the coefficients in my lasso regression model at the bottom of my text.

Some background information: I have 192 samples within two classes (healthy and disease) of my dataset. Therefore, I think that cross-validation would be more appropriate to evaluate my model than a train-test split. Also, I would like to use a fairly low amount of variables/genes to best predict disease status, so that is why I'm using lasso regression as a machine learning method.

To create the model I have used the caret package:

myControl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE


To evaluate the performance of the model I have used the caret and glmnet package:

model <- train(
disease_status~ .,
my_dataset,
method = "glmnet",
trControl = myControl
)


By printing the output of model, I can find that the accuracy of the model is at an alpha of 1.0 (and lambda of 0.01334) gives an roc of 0.82. However, I don't know how to print the classifiers that are used by my lasso regression model to receive this result. Coef(model) just returns NULL. Can anyone help me with these questions: if lasso regression and CV is best for solving my problem and how to print lasso classifiers?

lasso regression rna-seq cross-validation • 354 views
2
Entering edit mode
5 weeks ago
Mensur Dlakic ★ 19k

It is not clear whether you have 192 datasets or 192 features in some undefined number of datasets, but either way cross validation is a smart choice.

Your approach to feature selection is sound in general, but the alpha value may be too high. Don't know if you arrived at that via cross-validation from a large list of alphas or you picked it yourself. Either way, alpha=1.0 will shrink many feature coefficients, likely more than half of them. Maybe that's exactly what you want, but it may lead to underfitting as it will be dealing with too few features. That usually leads to a simplified model.

As to whether lasso is the best way, I'd say it is one of many ways. Tough to know exactly which one will work best without trying. Just about any tree-based classification method should work, and they tend to be better than lasso as long as one is careful not to overfit. They natively perform feature selection, though not necessarily by shrinking any coefficients to zero. You will still get relative feature ranking, and can select a desired number of features from the list. See below for a feature ranking example which is extracted from a gradient boosting classifier.

Finally, searching for feature selection on this site will likely give you more ideas. Here are several discussions where I contributed, so this is a self-promotion:

0
Entering edit mode

Thanks for answering! One question about CV: do I have to make a train (70% of all samples) - test split (30% of all samples) and perform CV and a machine learning method on the train set and evaluate performance on the test set? Or do I have to perform everything on the original dataset and calculate the predicting accuracy of the model on the original dataset including all the samples?

0
Entering edit mode

Yes, the idea is to perform CV on a train dataset and evaluate on test. For this to be reliable, some kind of a stratified split is needed so the datasets have the same proportion of classification categories, and ideally the same distributions.

You don't have to use the whole data for training, and generally it is only done when the dataset is very small and one can't afford to set aside any data for validation. In such cases leave-one-out CV is often performed instead of N-fold CV. Generally speaking, with smaller datasets higher fold number (10+) tends to help with generalization, while with larger datasets 3- or 5-fold CV is usually sufficient.

There is no hard rule, but your intended split (70:30) is unusual. I think most people go for 75:25 or 80:20 splits.

I noticed that glmnet may be calling its lasso regularization factor lambda, while in python implementation that factor is called alpha. If so, the regularization value you got (0.01334) is probably fine.

It seems that your command gave NULL because you didn't specify for which lambda the coefficients should be shown. You may find this page helpful in how to print the lasso coefficients.

Finally, if an answer was helpful and/or solved your problem, it is customary to upvote and/or accept it.