Hi,
I am working on a project comparing two groups (In Total n=187) (healthy vs. diseased) using the edgeR pipeline, which resulted in a list of 75 differentially expressed genes (DEGs). My aim is to build a predictive model for the diseased state using these 75 DEGs along with clinical data (age, sex, smoking, etc.).
To achieve this, I initially tried combining the DEGs and clinical variables in a single LASSO regression model. However, the clinical variables, despite being biologically relevant, were shrunk to zero coefficients.
To address this, I opted for a two-step approach:
- Use LASSO (with 10-fold cross-validation) on the training set to select the most informative genes.
- Incorporate the selected genes along with the clinical variables in a simple logistic regression model.
This is my current code:
library(glmnet)
set.seed(123)
cv.lassoModel <- cv.glmnet(
x = df.glm.train %>% dplyr::select(-c(clinical_variables, outcome)) %>% as.matrix(),
y = df.glm.train[,outcome],
standardize = TRUE,
alpha = 1,
nfolds = 10,
family = "binomial",
parallel = TRUE)
idealLambda <- cv.lassoModel$lambda.min
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
nonzero.genes <- data.matrix(co) %>% data.frame() %>%
rownames_to_column("Gene") %>%
dplyr::filter(!s1 == 0 & Gene != "(Intercept)")
finalLasso <- glm(as.formula(paste0(outcome, " ~ .")),
data= df.glm.test %>% dplyr::select(
nonzero.genes$Gene,
outcome,
clinical_variables
),
family = binomial(link="logit"))
I have the following questions regarding this approach:
- Is my two-step approach appropriate? I understand that the shrinkage effect of the LASSO-selected genes is lost in the logistic regression. Is there a way to retain it while including the clinical variables?
- My DGE analysis included the full dataset. Should I have held out the test set before conducting the DGE analysis to avoid data leakage? (Estimate dispersion seperately, etc.)
- I plan to apply a random forest model for the same purpose and compare it with the LASSO approach. Would a similar workflow (splitting into training/test sets, selecting genes, then combining with clinical data) be valid for random forest?
Thank you!
You can set the penalty factor of the clinical variables to 0 if you do not want shrinkage applied during LASSO.
psuedo code