Question: R survival analysis
2
hdy120 wrote:

I am now learning how to do survival analysis in R and using COX proportional hazards model, which can be referred to the 'coxph' function under package 'survival'. When I looked some online examples, eg, here, I found sometimes the code looks like

`Surv(time, censor) ~ factor(hercoc)`

But I can also do

`Surv(time, censor) ~ hercoc`

I was wondering what is the difference between these two and when shall I use which one. And how should I interpret the result from the one used 'factor'

survival R • 11k views
modified 6.0 years ago by russhh5.5k • written 6.0 years ago by hdy120

In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments

`treatments<- c(1,2,3) `

Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.

On the other hand, if the treatment was of one drug but at different concentrations such as

`treatment<- c(0,1,1.5,2) `

Then you should not factor these data because they are continuous.

At least that's my understanding, others please chime in

8
russhh5.5k wrote:

If your variable 'hercoc' has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You hvaen't provided any exaple data, and I am assuming that hercoc is numeric.

Using a more concrete example:

`library(survival)`

`attach(lung)`

`head(lung)`

```#     inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss #1      3  306      2  74   1       1       90       100     1175      NA #2      3  455      2  68   1       0       90        90     1225      15 #3      3 1010      1  56   1       0       90        90       NA      15 #4      5  210      2  57   1       1       90        60     1150      11```

The variable `sex` has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa).

```summary(as.factor(sex))   1   2  138  90```

There's only one coefficient fitted by the model regardless of how you write it:

`coxph(Surv(time, status) ~ sex)`

```Call: coxph(formula = Surv(time, status) ~ sex)```

```      coef exp(coef) se(coef)     z      p sex -0.531     0.588    0.167 -3.18 0.0015```

```Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165    coxph(Surv(time, status) ~ as.factor(sex)) Call: coxph(formula = Surv(time, status) ~ as.factor(sex))```

```                  coef exp(coef) se(coef)     z      p as.factor(sex)2 -0.531     0.588    0.167 -3.18 0.0015```

`Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165`

For the variable ph.ecog there are 4 levels

```summary(as.factor(ph.ecog)) #   0    1    2    3 NA's  #  63  113   50    1    1 ```

On fitting the survival model against ph.ecog it really does make a difference whetehr the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation);

```coxph(Surv(time, status) ~ ph.ecog) Call: coxph(formula = Surv(time, status) ~ ph.ecog)```

```         coef exp(coef) se(coef)   z       p ph.ecog 0.476      1.61    0.113 4.2 2.7e-05```

```Likelihood ratio test=17.6  on 1 df, p=2.77e-05  n= 227, number of events= 164     (1 observation deleted due to missingness)```

Howver, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1  2 and 3 each have a coefficient) and for a given indiviudal you would look up the coefficient corresponding to the level of the ecog factor.

```> coxph(Surv(time, status) ~ as.factor(ph.ecog)) Call: coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))```

```                     coef exp(coef) se(coef)    z       p as.factor(ph.ecog)1 0.369      1.45    0.199 1.86 6.3e-02 as.factor(ph.ecog)2 0.916      2.50    0.225 4.08 4.5e-05 as.factor(ph.ecog)3 2.208      9.10    1.026 2.15 3.1e-02```

```Likelihood ratio test=18.4  on 3 df, p=0.000356  n= 227, number of events= 164     (1 observation deleted due to missingness)```

Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)

I also found the "strata" function in survival analysis. What is the different between this "strata" and "factor"？

no idea I'm afraid. Might be better as another question