how to perform Kolmogorov-Smirnov statistic in GSEA in R?
2
4
Entering edit mode
9.1 years ago
Mo ▴ 920

I have a list of gens set of interest (assume a list of 20 probes) then I have a matrix showing the rank of genes which are highly expressed to low expression (10 columns and 300 rows representing 10 drugs and 300 probes) This matrix is not values but only probes names.

I want to perform the Kolmogorov-smirnov on it. does anybody know whether it is possible or not? if yes, how can I do it in R?

Here is an example of my geneses data

GSEA microarray statistical-analysis r • 7.1k views
1
Entering edit mode
9.1 years ago
Ahill ★ 1.9k

The KS test (in R, ?ks.test) compares continuous distributions. You cannot run it in a meaningful way on a list of probe names ranked by expression level. If you want to test if pre-defined genesets of size ~20 genes had different ranks after treatment with the 10 drugs, I'd look at something like GSEA-preranked.

1
Entering edit mode
9.1 years ago
mark.ziemann ★ 1.9k

GSEA preranked can be run in weighted or classic mode, where classic mode is basically Kolmogorov-Smirnov test (PMID:16199517). There is a stand alone R-GSEA (http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme). Here is a way I call GSEA-P from within R with this script I call "GSEAPR.R"

# GSEA Preranked runs the preranked GSEA program via a system call.
# it then reads in the genesets up/down

gseaPreRankedCmd <-function(rank,gmtfile,nperm=1000) {
codex<-paste("gs", gsub("0.","",as.character(runif(1))) ,sep="" )
wrk<-"output/"
rnkfilename<-paste(wrk,codex,".rnk",sep="")
out<-paste(wrk,codex,sep="")
write.table(rank,file=rnkfilename,sep="\t",col.names=F,row.names=F,quote=F)
cmd<- "java -Xmx64G -cp gsea2-2.0.12.jar xtools.gsea.GseaPreranked "
cmd <- paste(cmd,"-gmx",gmtfile)
cmd <- paste(cmd, "-gui false -make_sets true -rnd_seed timestamp -norm meandiv -zip_report false -scoring_scheme classic ")
cmd <- paste(cmd,"-rnk",rnkfilename)
cmd <- paste(cmd,"-out",out)
cmd <- paste(cmd,"-set_max 5000 -set_min 5 -mode Max_probe ")
#cmd <- paste(cmd,"-chip",chipfile)
cmd <- paste(cmd,"-collapse false -nperm ",nperm," -rpt_label 001 -plot_top_x 200")
cat(cmd)
ret<-system(cmd)
gs<- read.table(pipe(paste("cat ",wrk,codex,"/*/gsea_report_for*.xls|grep -v 'RANK AT MAX'",sep="") ) ,sep="\t")
return(gs)
}


You can call this R script like this:

#Load rank
gmtfile <- "/path/to/genesets.gmt"

#Apply function:
source("/path/to/GSEAPR.R")

gseaoutput <- gseaPreRankedCmd(rank, gmtfile ,nperm=1000)

write.table(gseaoutput, file="FILENAMEHERE.TXT", row.names=F, sep="\t", quote=F)


It's important that the gmt file and rank file are formatted perfectly as this is the main source of errors. The other is the chip file specification.

2
Entering edit mode

@mark.ziemann thank you so much Mark but it is more tricky! I tried to make the rnk and gmt. for the rnk I need to have two columns while I only have one which you can find an example here

You can find the geneset file in the main question

I appreciate any further explanation. thanks again

1
Entering edit mode

The classic mode of GSEA only takes into the rank of the genes and the direction, so if you have 16,000 genes and 8000 have +'ive foldchange and the rest -'ive fold change, you can give them a score from +8000 to -8000 and the GSEA should work if you abide by all formats for rnk/gmt/chip files.

1
Entering edit mode

@mark.ziemann is there any example to do so? I have been googling it for a week !!! still could not find any example or information

1
Entering edit mode

The rank file should have this format:

GeneID    Score
A1BG    0.267702
A2M    -0.0917815
A4GALT    0.290421
AAAS    0.465716
AACS    0.127215
AAED1    -0.761705
AAGAB    0.143995`

If you are having problems with formats, you could try the java GUI to troubleshoot them first.

0
Entering edit mode

@mark.ziemann how can I make those score for my probe set ? are they expression or fold change or something like that ? or should I calculate them based on a formula?

0
Entering edit mode

The signed log10 pvalue is a good choice reference.