Hi Biostars!

I have ran into an issue regarding a predictive model for gene expression data which i am trying to construct. The model in question is created for binomial gene expression data where i have used filtered DEG results as input matrix, with their corresponding phenotype as a response vector. These genes are then further reduced through cross-validated lasso regression via the glmnet package (alpha=1.0, nfold=10), where the final model-genes are chosen as the coefficients associated with lambda.1se. The issue which i am running into, is that the selected "best" model is often still too complex (resulting in a Pr(>|z|) close or equal to 1.0, and AUC for the model equal to 1.0). Reducing the number of predictors seems to solve this issue, however i am unsure of the correct way to do so.

I have considered performing stepwise regression based on AIC on the final model genes after the cross-validated lasso regression, or simply choosing the predictors that adhere closest to the glm regression line and reducing them until Pr(>|z|) > 0.05 for the predictors, but as i am new to predictive modeling i am not sure if either of these approaches are valid from a statistical point of view.

Any and all input regarding this is highly appreciated.