Hi everyone, I have recently moved to clinical settings in a hospital, and as my first assignment I am working with microarray data (Affymetrix) from cancer patients and I am facing a big challenge: how to adapt my experience in basic research to clinical application?
As, I presume, a lot of people here, I am quite familiar with using limma for linear model in microarray and RNAseq data for finding differentially expressed genes (DEGs). The matrix design is quite straight forward. However, I have been introduced to the methodology on our clinical settings and this is the pipeline they use for discovering DEGs:
- univariate logistic regression (P<0.05)
- to find DEGs that are associated to the cancer type
- less stringent to allow for more genes to be discovered
- multivariate logistic regression (P<0.01 or less)
- analysis performed only for those genes that passed the first analysis
- more stringent analysis to sort of validate what was found on the first step
- multivariate analysis (P<0.05 or less)
- performed only on those genes that passed the second analysis
- multivariate de facto: considering other variates such as levels of biomarkers (e.g. CA-125), age and weight
- ROC/AUC curves
- performed only for those genes that passed analysis 3
Quite frankly I am a bit lost with all those steps (i.e. why not start from step 2 directly?), and consequently in what packages and functions to use in R. I have tried this thread, and this one, but I believe they do not answer my questions to the fullest. So, any help in that regard (i.e. which tools to use, how to design the contrast matrix) will be extremely appreciated!
One last thing, for microarray data would one use gene expression as a response variable and quantitative traits (i.e. normal, cancer) as explanatory, or the other way around? Doesn't limma automatically set the former case?
Thanks a lot in advance!
Wow, this is just mesmerizing, Kevin! I immensely appreciate your help. In fact, this was more than that. It was a lecture. And that bonus in the end, for the sens. and spec., was just to top that all up. Thank you so much for taking the time to address my questions so thoroughly. I will go over it, together with the links you added, just as carefully. And I hope it is ok to come back if there's something I am stuck at. This community never ceases to amaze me. A thousand thanks!
Sure thing - no problem.
I bellieve am dealing with a complete or quasi-complete separation case. When trying to perform a multivariate analysis for each variable separately, by following the example, I am getting this:
I read some threads recommending bayesglm() or glmnet(). But first; I would like to know if there is anything I can do to optimize my data to use your proposed "Building a predictive model by a list of genes"?
PS1: the same issue appears when I run the combined model
PS2: the data being used is straight away from rma() normalization
PS3: I added the "control" argument to the glm() function because I was getting the following warning message:
Any help is much appreciated again.
How many variables are in the
glm()
? bayesglm and glmnet (lasso-penalised regression) can work with a large number of variables -glm()
cannot.I believe that was exactly my problem. I started with around 1,200 variables (genes), going down to 47 - cutting off by p-value. It worked on the last case. In that case, would you recommend to keep my list of selected variables as low as possible, or would it be worth trying the other methods - bayesglm and glmnet? I know it is briefly commented on this post. I am just afraid to lose some good predictors by excluding them to fit the glm() function. And many thanks again!
I think that it is worth trying various things, to be honest, and then comparing the AUC that you obtain for each. The key variables should be identified by each method.
When you get your 47 variables, by the way, that would be ideal for stepwise regression to further reduce that down. Stepwise regression will likely receive some frowns from Statisticians with whom you work, though. Just always check the models that it chooses and be acutely aware of the standard error.
If you also receive AUCs of 1 or they have very wide confidence intervals, be very suspicious. Same is true for confidence intervals from your models even before you generate ROC curves: if you see confidence intervals like 0.000004 or 12342.77, then that is almost definitely error.