Question

Finding the best combination of covarites in a multivariate linear regression

0

Entering edit mode

4.9 years ago

dodausp ▴ 180

Hi!

This post is sort of a follow-up questions from this previous thread: Exploring association between genes by their expression.

Briefly, say I have a list of candidate genes whose expressions showed (1) to be associated with overall survival (OS) (Cox regression), and (2) also associated among themselves (multivariate linear modeling). For example, high levels of gene A AND low levels of gene B AND low levels of gene C are associated with poor prognosis. So, if I am using only these 3 variables and a decent number of patients (e.g. n=200), it is not hard to find those patients with that combination of genes outcome.

However, if for instance I have a list comprising 8 of such candidates, the chances of finding a patient that fits this criterion (now for these 8 hits) is nearly none.

So my question is: is there a way to do some sort of permutation/combinatorial analysis coupled with Cox regression to find the combination of those 8 targets that best associates with OS? Considering that a satisfactory combination of factors is represented say in at least 30% of the patient population.

Would there be any package in R in which one could accomplish that?

Any light shed on this is very much appreciated.

Thanks!

gene expression overall survival linear model R • 1.6k views

ADD COMMENT • link 4.9 years ago by dodausp ▴ 180

1

Entering edit mode

Related post at SO: How to choose the best combination of covariates in multivariate linear regression?

ADD REPLY • link 4.9 years ago by zx8754 11k

0

Entering edit mode

Are the numbers of genes / patients you mention as examples the ones you are actually working with? It is a bit hard to understand from your question. In that case, I would use some machine learning to select features if needed (lasso) and model the explanatory variable of OS (random tree).

ADD REPLY • link 4.9 years ago by Martombo ★ 3.1k

0

Entering edit mode

Hi, Martombo

Thanks for the feedback. And you are right, maybe my question did not conceive the whole concept I was thinking about. I am putting down an example so I hope I can get my message across better this time.

So, say I have a data frame with 196 rows (sample ID) by 32 columns (gene names). Included in that data frame there is also information about OS status and last follow-up date. Here is how it looks:

> shortlist[1:5,1:10]
           Age    PFS PFS_codex     OS OS_codex   gene_1   gene_2   gene_3   gene_4   gene_5
Sample_1  67.9 117.23         0 115.69        0 9.451046 5.572303 7.260597 8.492154 4.010582
Sample_2    61  69.27         0  72.30        1 9.520935 9.956700 8.370941 6.854242 4.638455
Sample_3  69.1   1.23         1   1.08        1 9.691664 8.713712 8.840432 7.891189 3.707268
Sample_4  72.2  15.27         1  69.63        1 9.490668 9.015255 8.601908 9.584230 4.277126
Sample_5  40.7  61.43         1  78.41        1 9.439942 7.769697 7.337121 7.222432 4.843211

This shortlist is already with only those candidate that passed a Cox Regression univariate analysis. So now, I run Cox again, with the exception that this time all candidates from the shortlist are put up together:

multicox <- coxph(Surv(OS, OS_codex) ~ gene_1 + gene_2 + gene_3 + ... + gene_27, data=shortlist)

The results are the following (because of space, only those with significant p-values are listed):

> summary(multicox)
Call:
coxph(formula = Surv(OS, OS_codex) ~ gene_1 + gene_2 + gene_3 + ... + gene_27, data = shortlist)

  n= 196, number of events= 133 

            coef exp(coef) se(coef)      z Pr(>|z|)    
gene_1   0.80167   2.22926  0.22501  3.563 0.000367 ***
gene_8   0.15332   1.16570  0.06417  2.389 0.016888 *  
gene_9   0.76781   2.15505  0.24603  3.121 0.001803 ** 
gene_12 -0.84846   0.42807  0.30868 -2.749 0.005984 ** 
gene_18 -0.60773   0.54459  0.13789 -4.407 1.05e-05 ***
gene_26  0.35992   1.43321  0.14591  2.467 0.013634 *  
gene_27  0.41905   1.52052  0.14972  2.799 0.005127 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There are 7 candidates showing association to the OS (considering p<0.05). Based on that, and on the HR (to know whether upregulation or downregulation is associated to OS) of each one of them, I would like to test the hypothesis that the combination of all these candidates are indeed good predictors for poor prognosis.

For that, I would like to artificially create 2 groups of patients: 'high risk' and 'low risk'. The 'high risk' group is the one which the up- or down-regulation for each candidate showed to be associated with OS, respectively. The 'low risk' is the rest. (OBS: here, I am considering up- or down-regulation all values that are above and below the median for each gene, respectively)

First, I store the median values for all the 7 genes:

> a <- data.frame(gene_1=median(shortlist$gene_1), gene_8=median(shortlist$gene_8), gene_9=median(shortlist$gene_9), gene_12=median(shortlist$gene_12), gene_18=median(shortlist$gene_18),             gene_26=median(shortlist$gene_26), gene_27=median(shortlist$gene_27))
> a <- as.numeric(a)
> a
[1] 8.999678 5.681134 5.907599 8.420542 6.158107 3.279144 7.020527

So the classification now comprises the following:

shortlist$risk_class <- ifelse(shortlist$gene_1>a[1] & shortlist$gene_8>a[2] & shortlist$gene_9>a[3] & shortlist$gene_12<a[4] & shortlist$gene_18<a[5] & shortlist$gene_26>a[6] & shortlist$gene_27>a[7], "high_risk", "low_risk")

However, when I do that, there is only one patient that fits in that criteria:

>sum(shortlist$gene_1>a[1]&shortlist$gene_8>a[2]&shortlist$gene_9>a[3]&shortlist$gene_12<a[4]&shortlist$gene_18<a[5]&shortlist$gene_26>a[6]&shortlist$gene_27>a[7])
    [1] 1

So, my question is, would there be a way to test for the best combination of those 7 candidates, with a reasonable number of patients (i.e. >30)? I was thinking about some sort of permutation/combinatorial analysis coupled with Cox regression to find the combination of those targets that best associates with OS?

For example:

gene_1 + gene_9 + gene_18 + gene_26 OR

gene_1 + gene_18 + gene_27 OR

gene_9 + gene_18 + gene_26 + gene_27 OR

gene_18 + gene_27 ... and so forth.

First of all, does that rationale make any sense? And if so, how could I accomplish that?

Any ideas???? Thank you!

ADD REPLY • link 4.9 years ago by dodausp ▴ 180