If your variable 'hercoc' has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You hvaen't provided any exaple data, and I am assuming that hercoc is numeric.

Using a more concrete example:

`library(survival)`

`attach(lung)`

`head(lung)`

`# inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss`

#1 3 306 2 74 1 1 90 100 1175 NA

#2 3 455 2 68 1 0 90 90 1225 15

#3 3 1010 1 56 1 0 90 90 NA 15

#4 5 210 2 57 1 1 90 60 1150 11

The variable `sex` has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa).

`summary(as.factor(sex))`

1 2

138 90

There's only one coefficient fitted by the model regardless of how you write it:

`coxph(Surv(time, status) ~ sex)`

`Call:`

coxph(formula = Surv(time, status) ~ sex)

` coef exp(coef) se(coef) z p`

sex -0.531 0.588 0.167 -3.18 0.0015

`Likelihood ratio test=10.6 on 1 df, p=0.00111 n= 228, number of events= 165 `

coxph(Surv(time, status) ~ as.factor(sex))

Call:

coxph(formula = Surv(time, status) ~ as.factor(sex))

` coef exp(coef) se(coef) z p`

as.factor(sex)2 -0.531 0.588 0.167 -3.18 0.0015

`Likelihood ratio test=10.6 on 1 df, p=0.00111 n= 228, number of events= 165`

For the variable ph.ecog there are 4 levels

`summary(as.factor(ph.ecog))`

# 0 1 2 3 NA's

# 63 113 50 1 1

On fitting the survival model against ph.ecog it really does make a difference whetehr the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation);

`coxph(Surv(time, status) ~ ph.ecog)`

Call:

coxph(formula = Surv(time, status) ~ ph.ecog)

` coef exp(coef) se(coef) z p`

ph.ecog 0.476 1.61 0.113 4.2 2.7e-05

`Likelihood ratio test=17.6 on 1 df, p=2.77e-05 n= 227, number of events= 164 `

(1 observation deleted due to missingness)

Howver, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1 2 and 3 each have a coefficient) and for a given indiviudal you would look up the coefficient corresponding to the level of the ecog factor.

`> coxph(Surv(time, status) ~ as.factor(ph.ecog))`

Call:

coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))

` coef exp(coef) se(coef) z p`

as.factor(ph.ecog)1 0.369 1.45 0.199 1.86 6.3e-02

as.factor(ph.ecog)2 0.916 2.50 0.225 4.08 4.5e-05

as.factor(ph.ecog)3 2.208 9.10 1.026 2.15 3.1e-02

`Likelihood ratio test=18.4 on 3 df, p=0.000356 n= 227, number of events= 164 `

(1 observation deleted due to missingness)

Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)

In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments

Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.

On the other hand, if the treatment was of one drug but at different concentrations such as

Then you should not factor these data because they are continuous.

At least that's my understanding, others please chime in

2.4k