R survival analysis
Entering edit mode
8.6 years ago
hdy ▴ 160

I am now learning how to do survival analysis in R and using COX proportional hazards model, which can be referred to the 'coxph' function under package 'survival'. When I looked some online examples, eg, here, I found sometimes the code looks like

Surv(time, censor) ~ factor(hercoc)

But I can also do

Surv(time, censor) ~ hercoc

I was wondering what is the difference between these two and when shall I use which one. And how should I interpret the result from the one used 'factor'

R survival • 14k views
Entering edit mode

In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments

treatments<- c(1,2,3)

Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.

On the other hand, if the treatment was of one drug but at different concentrations such as

treatment<- c(0,1,1.5,2)

Then you should not factor these data because they are continuous.

At least that's my understanding, others please chime in

Entering edit mode
8.6 years ago
russhh 5.7k

If your variable hercoc has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You haven't provided any example data, and I am assuming that hercoc is numeric.

Using a more concrete example:




#     inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
#1      3  306      2  74   1       1       90       100     1175      NA
#2      3  455      2  68   1       0       90        90     1225      15
#3      3 1010      1  56   1       0       90        90       NA      15
#4      5  210      2  57   1       1       90        60     1150      11

The variable sex has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa).

  1   2 
138  90

There's only one coefficient fitted by the model regardless of how you write it:

coxph(Surv(time, status) ~ sex)

# Call:
coxph(formula = Surv(time, status) ~ sex)

      coef exp(coef) se(coef)     z      p
sex -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165 

coxph(Surv(time, status) ~ as.factor(sex))
coxph(formula = Surv(time, status) ~ as.factor(sex))

                  coef exp(coef) se(coef)     z      p
as.factor(sex)2 -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165

For the variable ph.ecog there are 4 levels

#   0    1    2    3 NA's 
#  63  113   50    1    1

On fitting the survival model against ph.ecog it really does make a difference whether the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation);

coxph(Surv(time, status) ~ ph.ecog)
coxph(formula = Surv(time, status) ~ ph.ecog)

         coef exp(coef) se(coef)   z       p
ph.ecog 0.476      1.61    0.113 4.2 2.7e-05

Likelihood ratio test=17.6  on 1 df, p=2.77e-05  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

However, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1 2 and 3 each have a coefficient) and for a given individual you would look up the coefficient corresponding to the level of the ecog factor.

> coxph(Surv(time, status) ~ as.factor(ph.ecog))
coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))

                     coef exp(coef) se(coef)    z       p
as.factor(ph.ecog)1 0.369      1.45    0.199 1.86 6.3e-02
as.factor(ph.ecog)2 0.916      2.50    0.225 4.08 4.5e-05
as.factor(ph.ecog)3 2.208      9.10    1.026 2.15 3.1e-02

Likelihood ratio test=18.4  on 3 df, p=0.000356  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)

Entering edit mode

I also found the "strata" function in survival analysis. What is the different between this "strata" and "factor"?

Entering edit mode

no idea I'm afraid. Might be better as another question

Entering edit mode

For strata, it means stratification. If you stratify on ph.ecog, you will have no coef estimation on that variable, but that variable has been adjusted in the model. I think you'd better check the distribution on that variable to see how many samples in each stratum of that variable first.


Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6