Question

Approach to building regression models or classifiers with high number of parameters but small sample size

0

Entering edit mode

3.1 years ago

Jeffrey3555054 ▴ 20

Hi everyone, as with the output of most next generation sequencing technologies, I have a large number of parameters (in my case I have computed some scores for a large number of genes so they are not exactly differential expression but they are still a numerical variable) yet I only have a small sample size. I'm rather new to this area of statistics but I'm facing the challenge of how to aggregate this large volume of data into making clinically meaningful inferences/predictions?

I now have parameters in the unit of thousands (~2000) and only 25 samples, and the end goal is to build a model with less parameters (after filtering from parameter selection) predicting nodal staging in cancer, so it can either be a continuous variable (e.g. stage 0 - stage 3) or it can be a discrete variable as well (e.g. > stage 2 or not). From what I gather after browsing various threads here and a bit of research, some methods to approach this problem are:

1) Univariate regression for each parameter first then selecting significant parameters to put into a multivariate regression model (from Performing univariate and multivariate logistic regression in gene expression data)

2) Stepwise linear/logistic regression (from Building a predictive model by a list of genes) which probably is less tedious compared to manually running all the univariate regressions

3) Lasso or elastic-net regression (from How to exclude some of breast cancer subtypes just by looking at gene expression?) to perform parameter selection as well as model fitting

4) Random forest regression

So my main questions now are:

As I'm unfamiliar with the underlying math/statistics, is there any guide or rule of thumb on which approach is preferable or are there any conditions that can help decide what approach should be used?
For lasso regression/random forest, it seems in tutorials I read that there is usually a training set and a testing set, but given my low sample size, can I put all observations into the training set or is it a must to still leave a few observations to act as the testing set?
For lasso regression, how do I optimize the alpha parameter (since most tutorials mention how to optimize the lambda parameter) using the glmnet package?
From my understanding, is random forest not able to perform parameter selection to reduce the number of explanatory variables (as in it keeps all the parameters I input and infers missing values when necessary but won't remove irrelevant parameters like lasso regression would)?

cancer machine-learning R scRNA-seq RNA-seq • 778 views

ADD COMMENT • link updated 3.1 years ago by Mensur Dlakic ★ 28k • written 3.1 years ago by Jeffrey3555054 ▴ 20

score 0 · Answer 1 · 2021-08-25

Bad news first: what you want can't be done well. If you are doing this to learn the process, then it doesn't matter what kind of data you have. But if you are doing this to make a clinically relevant model (or for research), your data is not sufficient.

You have what is commonly known as an underdetermined system, which in plain terms means that you have too many variables (in your case, genes) and not enough equations (in your case, samples). These kinds of systems either don't have a solution (which is actually not bad), or have an infinite number of solutions (which is bad because it leads to overfitting).

Two ways out of this predicament: get more samples (in your case a lot more), or reduce the number of variables (which seems to be your choice). Now, reducing 2000 genes to 1000 or 500 would not be a problem, but you need to get them down to 10 or even below. If it was that easy to find only 10 genes responsible for cancer progression (or the lack of it), someone would have done it already.

Last piece of advice: complex models (random forests would qualify) tend to overfit terribly on underdefined problems. The only chance you have that will avoid overfitting - and not a great one given your particular setup - is to model this using simple, linear methods. Lasso would work because it uses L1 regularization, which will squeeze many regression coefficients down to zero and effectively remove many variables. Still, not sure that even lasso can reliably eliminate enough variables that your data demands. If you still want to give it a try, a python solution is to run lasso in a cross-validation mode, which also will find the optimal parameter for alpha by fitting it across a range of values. If you decide to try it, I suggest the number of folds equal to your actual number of samples, which essentially becomes a leave-one-out cross-validation (LOOCV).

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

One last time: I want to stress that most likely you don't have enough data to make a reliable model no matter what kind of data wrangling is employed.