Question: how to perform survival analysis on a corrected and structured data
0
18 days ago by
Learner 220
Learner 220 wrote:

Hello,

I posted a question but seems like was very bad and not clear at all so I did not get any help. I worked on my question and made it very clear . I have studied and structured the data better and it is with 3 columns first column show 2 category (Drug means drug was added and the patient was monitored ) and WT (means wild type) The second column shows the number of dead patient. The third is the hour that it happened

``````df<- structure(list(Condition = structure(c(1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L), .Label = c("Drug", "WT"), class = "factor"), Number.of.Dead = c(18L,
29L, 21L, 28L, 11L, 23L, 12L, 20L, 10L, 18L, 9L, 16L, 9L, 15L,
8L, 14L, 7L, 13L, 6L, 12L, 3L, 12L, 2L, 10L), Time = c(1L, 1L,
2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L,
10L, 10L, 11L, 11L, 12L, 12L)), row.names = c(NA, -24L), class = "data.frame")
``````

Now I am doing the following but the output is a bit strange, what am I missing here?

``````require(survcomp)
require(survival)

x.label="Time (Hours)", y.label="Overall survival", main.title="",
leg.text=c("Drug", "WT"), leg.pos="topright", leg.bty="n", leg.inset=0,
.col=c("forestgreen","red3"),
xlim=c(0,40),
o.text="",
.lty=c(1,1), .lwd=c(1.75,1.75),
show.n.risk=TRUE, n.risk.step=10, n.risk.cex=0.8, verbose=FALSE)
``````

of course the status is all 1 (means the dead)

R • 124 views
modified 18 days ago by Kevin Blighe59k • written 18 days ago by Learner 220
2
18 days ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

Hey again, what I was implying in the [now deleted] other thread was that your data format is not suitable for the functions that you are using. Your data looks like this:

``````   Condition Number.of.Dead Time
1       Drug             18    1
2         WT             29    1
3       Drug             21    2
4         WT             28    2
5       Drug             11    3
6         WT             23    3
7       Drug             12    4
8         WT             20    4
9       Drug             10    5
10        WT             18    5
11      Drug              9    6
12        WT             16    6
13      Drug              9    7
14        WT             15    7
15      Drug              8    8
16        WT             14    8
17      Drug              7    9
18        WT             13    9
19      Drug              6   10
20        WT             12   10
21      Drug              3   11
22        WT             12   11
23      Drug              2   12
24        WT             10   12
``````

The problematic column is `Number.of.Dead`. We typically have a single row for every 'event' that's recorded (event == 1 death, in your study). So, technically, that first row representing 18 deaths should be expanded into at least 18 rows, representing each patient that has died.

Your dataset neither says anything about the patients who are still alive because there is no information about the total cohort size. So, a survival analysis in the form that you would like cannot be conducted.

Death would typically be encoded as `0` or `1`, i.e., a binary classifier.

The 'event' parameter that is passed to `Surv()` has the following manual entry:

event: The status indicator, normally 0=alive, 1=dead. Other choices are ‘TRUE’/‘FALSE’ (‘TRUE’ = death) or 1/2 (2=death). For interval censored data, the status indicator is 0=right censored, 1=event at ‘time’, 2=left censored, 3=interval censored. For multiple enpoint data the event variable will be a factor, whose first level is treated as censoring. Although unusual, the event indicator can be omitted, in which case all subjects are assumed to have an event.

## ----------------------

Another major point: your formula should reversed (but neither this will solve the situation). The syntax of the `Surv()` function is:

``````Surv(time, event)
``````

So, in the example you have provided, `time` is being regarded as `Number.of.Dead` - check your x-axis in the plot that's produced from your code.

Take a look at the example here, where you will see how the data should be structured: A: survfit(Surv()) P-value interpretation for 3 survival curves?

Kevin

@Kevin Blighe Thanks a lot Kevin for your very nice response, If I get the info for the alive patients in each hour and expand the dead ones , would it solve the issue?

1

It may help - yes. Can you get that information? Please be sure about the information that you need, though. You could take a look at other examples online, including the example that I showed.

For survival, we want to know those who died and those who never died within the time-frame of the study. Each individual should have a single record, i.e., a single row per patient in the data, with death encoded as `0`|`1`.

If you take a look at the data in my other thread, I will provide a few explanations:

``````   time status               TreatmentStrategy
1     9      1      Maintained <<- patient died in the 'Maintained' group at 9 days
2    13      1      Maintained
3    13      0      Maintained <<- patient did not die, but left that study for some reason at 13 days (patient is 'censored')
4    18      1      Maintained
5    23      1      Maintained
6    28      0      Maintained
7    31      1      Maintained
8    34      1      Maintained
9    45      0      Maintained <<- this patient is the last known to be in the study in the 'Maintained' group. They left the study at 45 days. You will see this reflected in the plot (below)
10   48      1 SuperMaintained <<- patient died in 'SuperMaintained' group at 48 days
11  161      0 SuperMaintained <<- last day recorded for the entire study, so, the plot x-axis only goes up to 161 days (see plot). This patient survived
12    5      1   Nonmaintained
13    5      1   Nonmaintained
14    8      1   Nonmaintained
15    8      1   Nonmaintained
16   12      1   Nonmaintained
17   16      0   Nonmaintained
18   23      1   Nonmaintained
19   27      1   Nonmaintained
20   30      1   Nonmaintained
21   33      1   Nonmaintained
22   43      1 SuperMaintained
23   45      1 SuperMaintained
``````

Basically, if you look through the data and compare it to the plot, you can see that:

• all, except 1, Nonmaintained patients died
• >75% of Maintained patients died
• ~75% of SuperMaintained patients died, but, of those who were alive, it can be loosely inferred that they made it to the final day of the study.

Not that survival says nothing about the days after the final day of the study. For all intents and purposes, all patients may have died at day 162, a day for which we have no data. This is why it's important to not manipulate survival data in order to make, e.g., a treatment / drug look favourable.

You could show this to your collaborator so that you get the correct data.

The data that you have can still be used, by the way - it could be a table in a manuscript.

@Kevin Blighe

Hi Kevin,

I removed those patients with no tag , I removed the hours that they didn't follow up. Now I cleaned the data and it looks like the following, what do you think?

``````df<-structure(list(Sample = c("WT", "WT", "WT", "WT", "WT", "WT",
"WT", "WT", "WT", "WT", "Drug", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "WT",
"WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "Drug",
"Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "Drug", "WT", "WT", "WT", "WT", "WT", "WT", "WT",
"WT", "WT", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "Drug", "Drug", "Drug", "WT", "WT", "WT", "WT",
"WT", "WT", "WT", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "Drug", "Drug", "Drug", "WT", "WT", "WT", "WT",
"Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "WT", "WT", "WT", "Drug", "Drug", "Drug", "Drug",
"Drug", "Drug", "Drug", "Drug", "Drug"), Patints = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Hour = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L), Condition = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-109L))
``````

1 means it is still alive and 0 means it is died

1

That looks a lot better, and more favourable for the drug.

Just one thing: the `Condition` should be encoded the other way:

• 1, death
• 0, alive (or patient left the study at a certain hour and we don't know if they are alive or dead, i.e., censored)

.

``````# flip condition
df\$Condition <- ifelse(df\$Condition == 1, 0, 1)

# convert sample to categorical and set order
df\$Sample <- factor(df\$Sample, levels = c('WT', 'Drug'))

require(survcomp)
require(survival)

km.coxph.plot(formula.s = Surv(Hour, Condition) ~ Sample, data.s = df, mark.time = TRUE,
x.label = 'Time (Hours)', y.label = 'Overall survival', main.title = '',
leg.text = c('WT', 'Drug'),
leg.pos = 'left', leg.bty = 'n', leg.inset = 0,
.col = c('forestgreen', 'red3'),
#xlim = c(0, 40),
o.text = '',
.lty = c(1,1), .lwd = c(1.75, 1.75),
show.n.risk = TRUE, n.risk.step = 2, n.risk.cex = 0.8, verbose = FALSE)
``````

Be very careful with the `leg.text` parameter - it should be in the same order as per the categorical variable, 'Sample', i.e., order of factors in `df\$Sample` should be the exact same as order of strata listed in `leg.text`

Some simple stats:

``````# flip the order of `Sample` back the other way:
df\$Sample <- factor(df\$Sample, levels = c('Drug', 'WT'))

# perform Cox proportional hazards analysis
summary(coxph(Surv(Hour, Condition) ~ Sample, data = df))

Call:
coxph(formula = Surv(Hour, Condition) ~ Sample, data = df)

n= 109, number of events= 13

coef exp(coef) se(coef)     z Pr(>|z|)
SampleWT 1.7323    5.6537   0.6071 2.853  0.00432 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
SampleWT     5.654     0.1769      1.72     18.58

Concordance= 0.715  (se = 0.068 )
Likelihood ratio test= 9.06  on 1 df,   p=0.003
Wald test            = 8.14  on 1 df,   p=0.004
Score (logrank) test = 10.25  on 1 df,   p=0.001
``````

So, we can say:

• Log rank p-value is p=0.001
• Hazard ratio for dying while not on drug is 5.654 (95% CI: 1.72, 18.58)

Great work

Kevin

1

@Kevin Blighe

Thanks a bunch, you put some time to guide me and I need to do some home work to digest the command you gave. I really appreciate your answer. I will let you know as soon as I get the point across. I liked all your answer and accepted it

Thanks again Kevin

@Kevin Blighe Hi, I posted a new question regarding drug_1 versus WT and drug_2 versus WT. Can you please have a look and give me some idea how to perform the statistic on that? thanks a lot

@Kevin Blighe

I like this show.n.risk = TRUE however, is it possible to make a distance between the lable and the values? I looked at info but I could not find anything. it seems just playing with n.risk.cex but when the data gets big it is a mess. any idea ? another problem is with sample size, in one condition is different than the other. so do you know a better test that I could perform in order to get a more robust result?