Question: Robust likelihood-based modeling with data partitions
0
gravatar for luisa-bisceglia
3 days ago by
luisa-bisceglia0 wrote:

I'm working with RNAseq TCGA data from a group of 85 cancer patient. I found the differentially expressed genes(604 genes) between cancer e normal tissue and on this set of gene I want to identify prognostic genes by survival analysis.

I used univariate cox proportional analysis(survival R package, coxph function), and I found a hundred genes related to survival. Now I want to evaluate the likelihood-based modeling with data partitions. I want to split my patient in training set and test set. To do so I use this code, R package rbsurv:

x <- read.csv2("expr_data.txt", sep = "\t", header = TRUE)

y <- read.csv2("survival_data.txt",sep = "\t", header = TRUE)



head(x)
           ensembl_ID   tumor01   tumor02   tumor03   tumor04   tumor05   tumor06   tumor07   tumor08    
1  ENSG00000007372.19  8.097539  6.691016  7.330965  7.751020  8.622778  7.548625  8.066513  6.900943   
2  ENSG00000007908.14  6.483139  4.523530  8.575936  6.763546  6.381578 10.143505  4.593375  8.982103  
3  ENSG00000008517.15 11.710044 10.311145 10.460052  9.498862  9.997961 12.285310 12.477057 10.392469 
4  ENSG00000026025.12 16.110803 15.336699 15.003814 14.958670 15.175336 16.419507 15.487635 16.717384 
5  ENSG00000038427.14 14.044901 13.193529 13.536949 11.301741 13.086164 14.876246 13.625205 15.622001 
6  ENSG00000049323.14 10.488359 11.398026  9.515488 10.274103 10.727591 12.997113 11.298661 11.474576  

head(y)
sample     time status clusters_n primary_diagnosis age_at_index
1 tumor01  385      1          3                 1       69
2 tumor02  719      1          3                 4       65
3 tumor03  228      0          3                 1       51
4 tumor04 1309      0          3                 3       68  
5 tumor05   76      1          2                 3       73    
6 tumor06  142      1          2                 3       80

## Risk factors
z <- cbind(y$age_at_index, y$primary_diagnosis)
colnames(z) <- c("Age","Istology")

##prepare file
id <- x[,1]
x <- x[,-1]
dim(x) # 133 probe sets and 85 sample
dim(y)
## Partition data into train and test sets
n <- ncol(x)
k <- rep((1:2), 50)[1:n]
set.seed(123)
i <- sample(k, size=n, replace = FALSE)
k.test <-(1:n)[i==1]
x.train <- x[,(-k.test)]
t.train <- y$time[(-k.test)]
s.train <-  y$status[(-k.test)]
z.train <- as.matrix(z[(-k.test),]) 
x.test <- x[,(k.test)]
t.test <- y$time[(k.test)]
s.test <- y$status[(k.test)]
z.test <- as.matrix(z[(k.test),])



       ## Robust modeling with risk factors
        fit <- rbsurv(time=t.train, status=s.train, x=x.train,  z=z.train, alpha=0.05, method="efron", max.n.genes=100, 
n.iter=20, n.fold=3,  n.seq=4)
    fit$model
    write.csv2(fit$model, "fit_model.txt")

the output looks like this:

Seq Order   Gene    nloglik AIC Selected
0   1   0   0   106,2   212,4    
110 1   1   63  95,92   193,84  *       
2   1   2   97  91,46   186,92  *       
3   1   3   41  88,26   182,52  *       
4   1   4   58  85,73   179,46  *       
5   1   5   126 83,96   177,92  *       
6   1   6   36  81,31   174,61  *       
7   1   7   105 79,37   172,73  *       
8   1   8   66  77,72   171,43  *       
9   1   9   9   77,16   172,32  *       
10  1   10  89  74,93   169,87  *       
11  1   11  52  70,8    163,59  *       
12  1   12  56  69,32   162,63  *       
13  1   13  40  69,12   164,25   
14  1   14  32  68,81   165,62   
15  1   15  85  67,92   165,84   
16  1   16  106 67,85   167,71   
17  1   17  3   67,04   168,09   
18  1   18  17  66,89   169,78   
19  1   19  4   66,89   171,77   
20  1   20  5   64,71   169,42   
21  1   21  6   64,57   171,13   
22  1   22  7   63,32   170,64   
1   2   0   0   106,2   212,4    
111 2   1   69  98,78   199,55  *       
23  2   2   94  95,05   194,11  *       
31  2   3   49  93,09   192,18  *       
41  2   4   102 90,7    189,41  *       
51  2   5   111 87,55   185,09  *       
61  2   6   68  85,61   183,22  *       
71  2   7   83  84,82   183,64  *       
81  2   8   110 82,03   180,05  *       
91  2   9   18  81,18   180,37  *       
101 2   10  131 80,49   180,98  *       
112 2   11  132 77,25   176,5   *       
121 2   12  108 75,38   174,76  *       
131 2   13  47  74,55   175,11   
141 2   14  37  73,96   175,93   
151 2   15  30  72,68   175,35   
161 2   16  19  72,31   176,63   
171 2   17  81  72,21   178,42   
181 2   18  8   71,8    179,6    
191 2   19  5   71,64   181,28   
201 2   20  15  71,6    183,21   
211 2   21  3   70,87   183,74   
221 2   22  4   70,81   185,63   
2   3   0   0   106,2   212,4    
113 3   1   46  99,89   201,77  *       
24  3   2   38  94,57   193,14  *       
32  3   3   130 93,7    193,41  *       
42  3   4   17  93,09   194,19  *       
52  3   5   129 92,62   195,25  *       
62  3   6   81  92,14   196,27  *       
72  3   7   78  88,81   191,62  *       
82  3   8   106 87,07   190,15  *       
92  3   9   99  86,05   190,09  *       
102 3   10  21  85,44   190,88   
114 3   11  95  85,37   192,75   
122 3   12  25  84,41   192,82   
132 3   13  37  83,26   192,53   
142 3   14  16  82,71   193,42   
152 3   15  98  82,6    195,19   
162 3   16  61  81  194  
3   4   0   0   106,2   212,4    
115 4   1   48  97,91   197,81  *       
25  4   2   85  95,43   194,86  *       
33  4   3   54  94,78   195,56  *       
43  4   4   55  93,14   194,28  *       
53  4   5   61  92,23   194,46  *       
63  4   6   42  90,77   193,54  *       
73  4   7   79  89,46   192,93  *       
83  4   8   8   88,12   192,24  *       
93  4   9   51  87,88   193,75  *       
103 4   10  21  86,74   193,48  *       
116 4   11  53  86,5    195 *       
123 4   12  125 86,46   196,92  *       
133 4   13  123 85,49   196,97  *       
143 4   14  7   84,3    196,6   *       
153 4   15  16  83,72   197,44  *       
163 4   16  67  78,9    189,8   *       
172 4   17  34  78,78   191,57  *       
182 4   18  133 76,76   189,52  *       
192 4   19  101 74,88   187,76  *       
202 4   20  3   74,76   189,52

Now my questions are:

  1. I ran my analysis on the training set, how can I validate the result on my test set? Should I run again the analysis on the test data, and then confront the two lists, and see how many genes are selected in both the lists?
  2. to achieve my goal of finding prognostic genes, is it correct to analyze the expression data first with univariate cox proportional hazard and then with a robust likelihood-based modeling with data partitions? In the last step I would like to analyze the likelihood analysis selected genes through multivariate cox proportional hazard. are all these steps correct and useful?
  3. Are there a better ways or R packages to achieve my goal?

    I'm really confused about all this process.

ADD COMMENTlink written 3 days ago by luisa-bisceglia0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour