Question

Set fixed amount of variables to use for lasso regression

0

Entering edit mode

3.2 years ago

bart ▴ 50

Hi,

I'm trying to use lasso regression to create a model to solve a classification problem (predict disease status). My datasets contains >16000 variables (RNA transcripts from genes) but I'd like to find only 5 or 10 genes or so, that can best predict disease status. However, using the code lines from this source: http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/#compute-lasso-regression, it is not possible to set a fixed number of variables to use. Also I'm not sure if lasso regression can be used for this purpose. The code lines I'm currently using:

#divide dataset into training and testing samples
traindata<-dataset[trainingsamples,]
testdata<-dataset[-trainingsamples,]
x<-model.matrix(patient.group~., traindata)[,-1]
y<-ifelse(traindata$patientgroup=="disease",1,0)
# Find the best lambda using cross-validation
set.seed(123) 
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
#find which variables are being used 
tmp_coeffs <- coef(cv.lasso, s = "lambda.1se")
data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient = tmp_coeffs@x)

Right now I'm using 42 genes to predict disease status which gives a good accuracy. However, does anyone know how one can reduce the amount of variables being used? Or do I have to use another machine learning strategy to do so?

Thanks!

learning lasso machine regression • 1.8k views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 29k • written 3.2 years ago by bart ▴ 50

score 1 · Answer 1 · 2022-05-05

Lasso doesn't work with a prescribed number of features. It will shrink as many (or as few) feature coefficients down to zero as needed to get the best fit. However, the absolute value of feature coefficients is essentially their importance, so you can still select a given number of features with highest absolute values. If feature 5 is multiplied by a coefficient of 0.5 while feature 7 is multiplied by 0.003, it should be pretty obvious that feature 5 contributes more to the final result. In the plot below features that are at the very top (33, 65, 199) or bottom (217, 117, 91) contribute more to the result than those that are in the middle (209, 288, 276).

Out of curiosity, what is wrong with using 42 genes that work well for you? That sounds like a manageable number.

enter image description here