Question

question about gene signatures

1

Entering edit mode

6.5 years ago

tujuchuanli ▴ 130

Hi,

Here I have a feature which was measured in patients from the BRCA dataset of TCGA (each patient has one value for this feature). It is a continuous variable. I want to find a set of genes measured by gene expression of RNA-seq which is highly correlated with this feature (so-called “gene signatures”). My plan is to use stepwise linear regression to find this set of genes.

My questions are:

Is it suitable to use stepwise linear regression here?
Could you please recommend some available package in R to do this job?
Do you have any other better choices to achieve my aim?

Thanks

gene signatures • 3.0k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 89k • written 6.5 years ago by tujuchuanli ▴ 130

score 3 · Answer 1 · 2019-01-03

3

Entering edit mode

6.5 years ago

Kevin Blighe 89k

I wrote some material in relation to this, here:

A: Resources for gene signature creation
What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?

Also, a tutorial on this by 2 former colleagues and I:

https://github.com/kevinblighe/Rtutorials (see Lecture 3 - RTrainingLect3.pptx)

Stepwise regression will not work for a large number of variables - it should be used for a few dozen or less variables. You can also cross-compare the results from forward, backward, and both of these combined stepwise regression. Usage of AIC or BIC must also be considered.

If you have more variables, you should think about a Bayesian regression model or penalised regression (lasso, ridge, elastic-net regression) (see A: How to exclude some of breast cancer subtypes just by looking at gene expressio ).

Finally, depending on to whom you talk, usage of stepwise regression will be criticised, but that is a converstaion to be debated by statisticians.

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

It is very nice of you to supply such precious materials to me. I have some questions after reading your post.

My response value is continuous value just like Chip-seq reads counts and my predictive values are expression of genes. Should I use “poisson” instead of “multinomial” in “cv.glmnet” function? (In the help page of “glmnet”, it recommend to use poisson when response is non-negative counts).
In the “Downstream” section of the post (https://www.biostars.org/p/287511/#287555), you built a new regression using the best predictors from “cv.glmnet”. But the best predictors as well as coefficients are already given by “cv.glmnet” (store in “co” in “Identify best predictors” section). Why do not you use these?
In my own analysis, I also plot variable deviances vs. shrinkage parameter, λ (lambda) as your “Perform 10-fold cross validation” section. But my plot is strange, the plot is below:

https://pan.baidu.com/s/1_oLWxTQDzrzoLz_V1oOrHA

Is it normal? What is the key of this plot? (Just like the scatterplot of predictor vs residuals in “lm” function which should look like stars in the sky)

I plot the scatterplot of response vs predicted response by using “points(modellingGlm$TB,as.vector(fitted(finalGlm)))”, the plot is here (https://pan.baidu.com/s/1t9_CGxTVY_l-XqijdBteSQ). The Pearson coefficient value is 0.97. The p-value of wald.test is almost 0. Can I say that the model is suitable with confidence?

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

My response value is continuous value just like Chip-seq reads counts and my predictive values are expression of genes. Should I use “poisson” instead of “multinomial” in “cv.glmnet” function? (In the help page of “glmnet”, it recommend to use poisson when response is non-negative counts).

Yes, you should use Poisson. The counts are normalised, correct?

In the “Downstream” section of the post (https://www.biostars.org/p/287511/#287555), you built a new regression using the best predictors from “cv.glmnet”. But the best predictors as well as coefficients are already given by “cv.glmnet” (store in “co” in “Identify best predictors” section). Why do not you use these?

The model in Downstream is constructed so that one can then perform ROC analysis, make model predictions, and derive p-values. Often, glmnet does not function too good, so, it is important to not use it as the sole method of identifying variables.

In my own analysis, I also plot variable deviances vs. shrinkage parameter, λ (lambda) as your “Perform 10-fold cross validation” section. But my plot is strange, the plot is below:

I cannot load the image(?)

Did you follow the example, here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#poi ?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Thank you for your replying, two additional questions:

I used the best predictors from “cv.glmnet” function to build the new model using “glm” function, then I summarized the regression using “summary” function. I found that not all predictors here are significant. Can I perform second round of stepwise regression based on the best predictors from “cv.glmnet” function?
I have check the url (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#poi). I basically follow this manual and your post. But it does not descript the detail of this figure. I update my figure of variable deviances vs. shrinkage parameter, λ (lambda) as below. Can you see it now?

enter image description here

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

I used the best predictors from “cv.glmnet” function to build the new model using “glm” function, then I summarized the regression using “summary” function. I found that not all predictors here are significant. Can I perform second round of stepwise regression based on the best predictors from “cv.glmnet” function?

Yes, I believe it helps to perform some extra step on the predictors from glmnet, like, for example, stepwise regression.

I have check the url (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#poi). I basically follow this manual and your post. But it does not descript the detail of this figure. I update my figure of variable deviances vs. shrinkage parameter, λ (lambda) as below. Can you see it now?

Yes, I can see it. Can you plot a histogram of your count data?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Below is the histogram of my count data (response value):

It is generated by "hist" function

enter image description here

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

It is not a 'perfect' Poisson.. it has impurities / 'bumps'. What are the units of these?; how did you normalise the data? What is the range of values? Maybe you could try to shift the data by +1.

Do the predictors seem okay? I mean, are the results what you expected?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

The response value is mutation load. It is calculated as the total number of miss-sense mutations in CDS region per patient. Here I want to explore the expression profile correlated to it. What I can say is that the distribution of this value is similar with previous works. It is hard to say how to do further normalization. In some previous works they furtherly calculated ratio between this counts and the coverage region of gene which does not matter in my mind.

The predictor value is log transformed TMM normalized RNA-seq reads counts.

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

How did you log-transform the RNA-seq counts? The deviance and error bars are just very big.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Do you mean that I should use TMM normalized RNA-seq counts without log-transformed as predictor value? In survival analysis it is recommanded to use log transformed TMM normalized RNA-seq counts. what is the difference?

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

Yes, logged data is better. However, the data in your histogram does not appear to be logged (?). Also, if you used the EdgeR program (?), you should produce logCPM counts and use that for predictors.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, logged data is better. However, the data in your histogram does not appear to be logged (?).

I think there is a missleading. The histogram is response value, which is mutation load not expression of gene (the expression value is used as predictor). Do you mean I should try log-transformed mutation load as response? the histogram of log-transformed mutation load is below which seems to be normal distribution

Also, if you used the EdgeR program (?), you should produce logCPM counts and use that for predictors.

yes, I used EdgeR and produced logCPM

enter image description here

ADD REPLY • link 6.5 years ago by tujuchuanli ▴ 130

0

Entering edit mode

yes, I used EdgeR and produced logCPM

Great!

I think there is a missleading. The histogram is response value,

I see - thank you for confirming. You could, yes, log this response variable and change the family to Gaussian. I think that the error bars will be lower in that case.

Another thought: you could di- or tri-chotomise your response variable and switch to binomial / multinomial logistic regression with glmnet. I wrote more about this, here:

A: Why quantitative design are preferred GWAS approach

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k