I have a bunch of HT-seq data from the TCGA and the accompanying clinical metadata. All the patients have cancer. At the end of the surveillance period, some of the patients are dead, others are not. Lets call the deceased patients group A and the surviving patients group B. Also I know when the patients who eventually died actually succumbed to their disease. I know that performing a KS-test for gene set enrichment using either gene set C and gene set D gives me a KS p-value << 0.00001 when the two groups of patients are A: alive at the end of surveillance and B: deceased at the end of surveillance. How to I investigate the hypothesis that pathway C and pathway D are acting together in the patients who do not survive surveillance?
As far as I can see, there are at least a couple approaches to take:
1) Do a KS-test on group A and B using gene set C. Repeat with gene set D. Conclude that pathway C and D are related and associated with patient death based on the algebraic mean of the two p-values.
2) Look only at the patients who died under surveillance and cluster them based on expression of Union(C, D), then compare the survival functions of the resulting clusters.
I'm not sure what packages are best to conduct this analysis, so suggestions are appreciated. Thanks!
Here is an updated strategy based on a recently published paper
0) Take all the patients in Group A
1) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 1
2) denote PC1 as PC1-GS1
3) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 2
4) denote PC1 as PC1-GS2
5) Run K-means clustering (let k=3) on patients in Group A, where each patient maps to a point in 2-space corresponding to PC1-GS1, PC2-GS2.
6) Plot survival curves for cluster 1:3
7) Perhaps high PC1-GS2, high PC1-GS1 correlates to worse survival than high PC1-GS2, low PC1-GS1 or low PC1-GS2, high PC1-GS1
Here is the issue:
All the normalized mRNA-seq data that I've downloaded from the GDC shows counts mapped to Ensembl transcript IDs, but the gene sets that I've downloaded from MSigDB in the form of Entrez Gene IDs!
There is apparently a way to map Ens Transcript IDs to Entrez GIDs using the R package biomaRt, however the query that I have written (sample below uses only 5 transcript IDs) is failing.
>library(biomaRt) >ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl") >getBM(attributes = c('entrezgene'), filters = 'ensembl_transcript_id', values = c("ENSG00000270112.3", "ENSG00000167578.15", "ENSG00000273842.1", "ENSG00000078237.5", "ENSG00000146083.10"), mart = ensembl)  entrezgene <0 rows> (or 0-length row.names)
I'm shocked that GDC, being an NCI entity doesn't use Entrez annotation for everything but such is life...