I'm working with RNAseq TCGA data from a group of 85 cancer patient. I found the differentially expressed genes(604 genes) between cancer e normal tissue and on this set of gene I want to identify prognostic genes by survival analysis.
I used univariate cox proportional analysis(survival R package, coxph function), and I found a hundred genes related to survival. Now I want to evaluate the likelihood-based modeling with data partitions. I want to split my patient in training set and test set. To do so I use this code, R package rbsurv:
x <- read.csv2("expr_data.txt", sep = "\t", header = TRUE)
y <- read.csv2("survival_data.txt",sep = "\t", header = TRUE)
head(x)
ensembl_ID tumor01 tumor02 tumor03 tumor04 tumor05 tumor06 tumor07 tumor08
1 ENSG00000007372.19 8.097539 6.691016 7.330965 7.751020 8.622778 7.548625 8.066513 6.900943
2 ENSG00000007908.14 6.483139 4.523530 8.575936 6.763546 6.381578 10.143505 4.593375 8.982103
3 ENSG00000008517.15 11.710044 10.311145 10.460052 9.498862 9.997961 12.285310 12.477057 10.392469
4 ENSG00000026025.12 16.110803 15.336699 15.003814 14.958670 15.175336 16.419507 15.487635 16.717384
5 ENSG00000038427.14 14.044901 13.193529 13.536949 11.301741 13.086164 14.876246 13.625205 15.622001
6 ENSG00000049323.14 10.488359 11.398026 9.515488 10.274103 10.727591 12.997113 11.298661 11.474576
head(y)
sample time status clusters_n primary_diagnosis age_at_index
1 tumor01 385 1 3 1 69
2 tumor02 719 1 3 4 65
3 tumor03 228 0 3 1 51
4 tumor04 1309 0 3 3 68
5 tumor05 76 1 2 3 73
6 tumor06 142 1 2 3 80
## Risk factors
z <- cbind(y$age_at_index, y$primary_diagnosis)
colnames(z) <- c("Age","Istology")
##prepare file
id <- x[,1]
x <- x[,-1]
dim(x) # 133 probe sets and 85 sample
dim(y)
## Partition data into train and test sets
n <- ncol(x)
k <- rep((1:2), 50)[1:n]
set.seed(123)
i <- sample(k, size=n, replace = FALSE)
k.test <-(1:n)[i==1]
x.train <- x[,(-k.test)]
t.train <- y$time[(-k.test)]
s.train <- y$status[(-k.test)]
z.train <- as.matrix(z[(-k.test),])
x.test <- x[,(k.test)]
t.test <- y$time[(k.test)]
s.test <- y$status[(k.test)]
z.test <- as.matrix(z[(k.test),])
## Robust modeling with risk factors
fit <- rbsurv(time=t.train, status=s.train, x=x.train, z=z.train, alpha=0.05, method="efron", max.n.genes=100,
n.iter=20, n.fold=3, n.seq=4)
fit$model
write.csv2(fit$model, "fit_model.txt")
the output looks like this:
Seq Order Gene nloglik AIC Selected
0 1 0 0 106,2 212,4
110 1 1 63 95,92 193,84 *
2 1 2 97 91,46 186,92 *
3 1 3 41 88,26 182,52 *
4 1 4 58 85,73 179,46 *
5 1 5 126 83,96 177,92 *
6 1 6 36 81,31 174,61 *
7 1 7 105 79,37 172,73 *
8 1 8 66 77,72 171,43 *
9 1 9 9 77,16 172,32 *
10 1 10 89 74,93 169,87 *
11 1 11 52 70,8 163,59 *
12 1 12 56 69,32 162,63 *
13 1 13 40 69,12 164,25
14 1 14 32 68,81 165,62
15 1 15 85 67,92 165,84
16 1 16 106 67,85 167,71
17 1 17 3 67,04 168,09
18 1 18 17 66,89 169,78
19 1 19 4 66,89 171,77
20 1 20 5 64,71 169,42
21 1 21 6 64,57 171,13
22 1 22 7 63,32 170,64
1 2 0 0 106,2 212,4
111 2 1 69 98,78 199,55 *
23 2 2 94 95,05 194,11 *
31 2 3 49 93,09 192,18 *
41 2 4 102 90,7 189,41 *
51 2 5 111 87,55 185,09 *
61 2 6 68 85,61 183,22 *
71 2 7 83 84,82 183,64 *
81 2 8 110 82,03 180,05 *
91 2 9 18 81,18 180,37 *
101 2 10 131 80,49 180,98 *
112 2 11 132 77,25 176,5 *
121 2 12 108 75,38 174,76 *
131 2 13 47 74,55 175,11
141 2 14 37 73,96 175,93
151 2 15 30 72,68 175,35
161 2 16 19 72,31 176,63
171 2 17 81 72,21 178,42
181 2 18 8 71,8 179,6
191 2 19 5 71,64 181,28
201 2 20 15 71,6 183,21
211 2 21 3 70,87 183,74
221 2 22 4 70,81 185,63
2 3 0 0 106,2 212,4
113 3 1 46 99,89 201,77 *
24 3 2 38 94,57 193,14 *
32 3 3 130 93,7 193,41 *
42 3 4 17 93,09 194,19 *
52 3 5 129 92,62 195,25 *
62 3 6 81 92,14 196,27 *
72 3 7 78 88,81 191,62 *
82 3 8 106 87,07 190,15 *
92 3 9 99 86,05 190,09 *
102 3 10 21 85,44 190,88
114 3 11 95 85,37 192,75
122 3 12 25 84,41 192,82
132 3 13 37 83,26 192,53
142 3 14 16 82,71 193,42
152 3 15 98 82,6 195,19
162 3 16 61 81 194
3 4 0 0 106,2 212,4
115 4 1 48 97,91 197,81 *
25 4 2 85 95,43 194,86 *
33 4 3 54 94,78 195,56 *
43 4 4 55 93,14 194,28 *
53 4 5 61 92,23 194,46 *
63 4 6 42 90,77 193,54 *
73 4 7 79 89,46 192,93 *
83 4 8 8 88,12 192,24 *
93 4 9 51 87,88 193,75 *
103 4 10 21 86,74 193,48 *
116 4 11 53 86,5 195 *
123 4 12 125 86,46 196,92 *
133 4 13 123 85,49 196,97 *
143 4 14 7 84,3 196,6 *
153 4 15 16 83,72 197,44 *
163 4 16 67 78,9 189,8 *
172 4 17 34 78,78 191,57 *
182 4 18 133 76,76 189,52 *
192 4 19 101 74,88 187,76 *
202 4 20 3 74,76 189,52
Now my questions are:
- I ran my analysis on the training set, how can I validate the result on my test set? Should I run again the analysis on the test data, and then confront the two lists, and see how many genes are selected in both the lists?
- to achieve my goal of finding prognostic genes, is it correct to analyze the expression data first with univariate cox proportional hazard and then with a robust likelihood-based modeling with data partitions? In the last step I would like to analyze the likelihood analysis selected genes through multivariate cox proportional hazard. are all these steps correct and useful?
Are there a better ways or R packages to achieve my goal?
I'm really confused about all this process.