Question

LASSO regression for gene selection and clinical variables

1

Entering edit mode

9 months ago

fabian ▴ 10

Hi,

I am working on a project comparing two groups (In Total n=187) (healthy vs. diseased) using the edgeR pipeline, which resulted in a list of 75 differentially expressed genes (DEGs). My aim is to build a predictive model for the diseased state using these 75 DEGs along with clinical data (age, sex, smoking, etc.).

To achieve this, I initially tried combining the DEGs and clinical variables in a single LASSO regression model. However, the clinical variables, despite being biologically relevant, were shrunk to zero coefficients.

To address this, I opted for a two-step approach:

Use LASSO (with 10-fold cross-validation) on the training set to select the most informative genes.
Incorporate the selected genes along with the clinical variables in a simple logistic regression model.

This is my current code:

library(glmnet)
set.seed(123)
cv.lassoModel <- cv.glmnet(
  x = df.glm.train %>% dplyr::select(-c(clinical_variables, outcome)) %>% as.matrix(),
  y = df.glm.train[,outcome],
  standardize = TRUE,
  alpha = 1,
  nfolds = 10,
  family = "binomial",
  parallel = TRUE)
idealLambda <- cv.lassoModel$lambda.min
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
nonzero.genes <- data.matrix(co) %>%  data.frame() %>%
  rownames_to_column("Gene") %>%
  dplyr::filter(!s1 == 0 & Gene != "(Intercept)")

finalLasso <- glm(as.formula(paste0(outcome, " ~ .")),
  data= df.glm.test %>% dplyr::select(
    nonzero.genes$Gene,
    outcome,
    clinical_variables
    ),
  family = binomial(link="logit"))

I have the following questions regarding this approach:

Is my two-step approach appropriate? I understand that the shrinkage effect of the LASSO-selected genes is lost in the logistic regression. Is there a way to retain it while including the clinical variables?
My DGE analysis included the full dataset. Should I have held out the test set before conducting the DGE analysis to avoid data leakage? (Estimate dispersion seperately, etc.)
I plan to apply a random forest model for the same purpose and compare it with the LASSO approach. Would a similar workflow (splitting into training/test sets, selecting genes, then combining with clinical data) be valid for random forest?

Thank you!

LASSO feature classification selection edgeR • 973 views

ADD COMMENT • link updated 9 months ago by Mensur Dlakic ★ 30k • written 9 months ago by fabian ▴ 10

2

Entering edit mode

You can set the penalty factor of the clinical variables to 0 if you do not want shrinkage applied during LASSO.

penalty.factor:
Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Also, any penalty.factor that is set to inf is converted to an exclude, and then internally reset to 1. Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.

The penalty.factor argument allows users to apply separate penalty factors to each coefficient. This is very useful when we have prior knowledge or preference over the variables.

psuedo code

y <- df.glm.train[, outcome]
x_df <- df.glm.train %>% select(-c(outcome))
x <- model.matrix( ~ 0 + ., x_df)

pf <- ifelse(grepl("age|sex|smoking", colnames(x)), 0, 1)

fit <- glmnet(x, y, penalty.factor = pf, ....)

ADD REPLY • link 9 months ago by Barry Digby ★ 1.4k

score 2 · Answer 1 · 2025-01-14

There is no need for a 2-step process. Logistic regression (LR) with L1 penalty is equivalent to doing Lasso regression first and using only the genes with non-zero coefficients. Similarly, using LR with L2 penalty is equivalent to using Ridge regression, where no coefficients will be zeros, but some of them could be very small. You can also do either Lasso or Ridge without LR, except that you wouldn't get true probabilities.

Using random forests (RFs), or any gradient boosted trees approach, eliminates the need for variable selection. Weights that are assigned to each gene will be calculated on the fly based on the relative importance when splitting the trees. That also would make the 2-step process unnecessary.

You may need to do DGE on a smaller dataset and set aside some genes for validation. Everything needs to be cross-validated, and CV scores must be similar to actual scores on your validation data.

I suggest you do LR with L1 and L2 penalties and see how they compare to each other. Then compare them with RFs. There should be some internal consistency in your results with all approaches, but RFs are likely to be slightly better. That said, there is a value in choosing a simpler model that is less likely to overfit (LR) over a more complex model (RF). If RF is better by 2-3% and cross-validation scores are consistent, it would be worth going with that. If RF is only marginally better (0.1-0.5%), I'd consider going with a simpler model that is less likely to overfit.