We have combined 4 GEO datasets, removed batch effect using ComBAT, and extracted the genes we are interested in. Each gene has multiple probes. However, 1 cohort missing the 3 probes( but they have different probes for those genes). We would like to do survival analysis by combining multiple genes. Our aim is to compare survival of high expressed vs low expressed.However, we have multiple questions:
Genes A1 A2 A3 B1 B2 B3 C1 C2 C3 Batch1 NA 6.1 7.6 5.0 4.4 NA 6.4 6.4 NA Batch2 5.9 5.9 8.3 5.2 5.1 5.1 6.7 6.3 6.3 Batch3 6.4 6.4 8.2 5.1 5.3 5.3 6.7 6.7 6.7 Batch4 5.6 7.1 6.3 6.3 8.1 6.5 5.4 6.0 4.9
- Should we combine probes of same genes? If yes, which way do you suggest : Average, median, or others ?
- When we are combining the probes what should we do the missing probes? Should we totally exclude A1, B3 &C3 from our analysis or for Batch 2,3,4 : combine A1,A2.A3 & for Batch 1: combine A2, A3 ?
- After combining the probes, we would like to see the 3 combined genes effect on survival so to get their combined expression is it ok to use Avg(A,B,C) +1/2 SD ? or what do you suggest ?
- As a next step, how should we define the threshold for high/ low expression ? Is using Z score on the combined 3 gene expression is ok to set the threshold? 0 will be the base & negative values defines low expression, whereas high values define high expression of the combined genes?
Thanks in advance,