Question: R survival analysis
gravatar for hdy
3.4 years ago by
United States
hdy80 wrote:

I am now learning how to do survival analysis in R and using COX proportional hazards model, which can be referred to the 'coxph' function under package 'survival'. When I looked some online examples, eg, here, I found sometimes the code looks like 

Surv(time, censor) ~ factor(hercoc)

But I can also do 

Surv(time, censor) ~ hercoc


I was wondering what is the difference between these two and when shall I use which one. And how should I interpret the result from the one used 'factor'

survival R • 7.1k views
ADD COMMENTlink modified 3.4 years ago by russhh2.8k • written 3.4 years ago by hdy80

In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments 

treatments<- c(1,2,3) 

Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.

On the other hand, if the treatment was of one drug but at different concentrations such as 

treatment<- c(0,1,1.5,2) 

Then you should not factor these data because they are continuous. 

At least that's my understanding, others please chime in 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by QVINTVS_FABIVS_MAXIMVS2.0k
gravatar for russhh
3.4 years ago by
UK, U. Glasgow
russhh2.8k wrote:

If your variable 'hercoc' has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You hvaen't provided any exaple data, and I am assuming that hercoc is numeric. 


Using a more concrete example:




#     inst time status age sex ph.ecog ph.karno pat.karno wt.loss
#1      3  306      2  74   1       1       90       100     1175      NA
#2      3  455      2  68   1       0       90        90     1225      15
#3      3 1010      1  56   1       0       90        90       NA      15
#4      5  210      2  57   1       1       90        60     1150      11

The variable `sex` has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa). 

  1   2 
138  90

There's only one coefficient fitted by the model regardless of how you write it:

coxph(Surv(time, status) ~ sex)

coxph(formula = Surv(time, status) ~ sex)

      coef exp(coef) se(coef)     z      p
sex -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165 
coxph(Surv(time, status) ~ as.factor(sex))
coxph(formula = Surv(time, status) ~ as.factor(sex))

                  coef exp(coef) se(coef)     z      p
as.factor(sex)2 -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165

For the variable ph.ecog there are 4 levels

#   0    1    2    3 NA's 
#  63  113   50    1    1 

On fitting the survival model against ph.ecog it really does make a difference whetehr the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation); 

coxph(Surv(time, status) ~ ph.ecog)
coxph(formula = Surv(time, status) ~ ph.ecog)

         coef exp(coef) se(coef)   z       p
ph.ecog 0.476      1.61    0.113 4.2 2.7e-05

Likelihood ratio test=17.6  on 1 df, p=2.77e-05  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Howver, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1  2 and 3 each have a coefficient) and for a given indiviudal you would look up the coefficient corresponding to the level of the ecog factor.

> coxph(Surv(time, status) ~ as.factor(ph.ecog))
coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))

                     coef exp(coef) se(coef)    z       p
as.factor(ph.ecog)1 0.369      1.45    0.199 1.86 6.3e-02
as.factor(ph.ecog)2 0.916      2.50    0.225 4.08 4.5e-05
as.factor(ph.ecog)3 2.208      9.10    1.026 2.15 3.1e-02

Likelihood ratio test=18.4  on 3 df, p=0.000356  n= 227, number of events= 164 
   (1 observation deleted due to missingness)


Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)


ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by russhh2.8k

I also found the "strata" function in survival analysis. What is the different between this "strata" and "factor"?

ADD REPLYlink written 3.4 years ago by hdy80

no idea I'm afraid. Might be better as another question

ADD REPLYlink written 3.4 years ago by russhh2.8k

For strata, it means stratification. If you stratify on ph.ecog, you will have no coef estimation on that variable, but that variable has been adjusted in the model. I think you'd better check the distribution on that variable to see how many samples in each stratum of that variable first.

ADD REPLYlink written 2.7 years ago by xrao10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour