Question: how to perform Kolmogorov-Smirnov statistic in GSEA in R?
gravatar for Mo
6.0 years ago by
Mo920 wrote:
I have a list of gens set of interest (assume a list of 20 probes) 
then I have a matrix showing the rank of genes which are highly expressed to low expression (10 columns and 300 rows representing 10 drugs and 300 probes) This matrix is not values but only probes names.  

I want to perform the Kolmogorov-smirnov on it. does anybody know whether it is possible or not? if yes, how can I do it in R ?  

here is an example of my geneses data 


ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Mo920
gravatar for Ahill
6.0 years ago by
United States
Ahill1.9k wrote:

The KS test (in R, ?ks.test) compares continuous distributions.  You cannot run it in a meaningful way on a list of probe names ranked by expression level.  If you want to test if pre-defined genesets of size ~20 genes had different ranks after treatment with the 10 drugs, I'd look at something like GSEA-preranked.

ADD COMMENTlink written 6.0 years ago by Ahill1.9k
gravatar for mark.ziemann
6.0 years ago by
mark.ziemann1.3k wrote:

GSEA preranked can be run in weighted or classic mode, where classic mode is basically Kolmogorov-Smirnov test (PMID:16199517). There is a stand alone R-GSEA ( Here is a way I call GSEA-P from within R with this script I call "GSEAPR.R"

# GSEA Preranked runs the preranked GSEA program via a system call.
# it then reads in the genesets up/down

gseaPreRankedCmd <-function(rank,gmtfile,nperm=1000) {
  codex<-paste("gs", gsub("0.","",as.character(runif(1))) ,sep="" )
  cmd<- "java -Xmx64G -cp gsea2-2.0.12.jar xtools.gsea.GseaPreranked "
  cmd <- paste(cmd,"-gmx",gmtfile)
  cmd <- paste(cmd, "-gui false -make_sets true -rnd_seed timestamp -norm meandiv -zip_report false -scoring_scheme classic ")
  cmd <- paste(cmd,"-rnk",rnkfilename)
  cmd <- paste(cmd,"-out",out)
  cmd <- paste(cmd,"-set_max 5000 -set_min 5 -mode Max_probe ")
  #cmd <- paste(cmd,"-chip",chipfile)
  cmd <- paste(cmd,"-collapse false -nperm ",nperm," -rpt_label 001 -plot_top_x 200")
  gs<- read.table(pipe(paste("cat ",wrk,codex,"/*/gsea_report_for*.xls|grep -v 'RANK AT MAX'",sep="") ) ,sep="\t")

You can call this R script like this:

#Load rank
rank <- read.delim("/path/to/rank.rnk")
#Define link to gmtfile
gmtfile <- "/path/to/genesets.gmt"

#Apply function:

gseaoutput <- gseaPreRankedCmd(rank, gmtfile ,nperm=1000)

write.table(gseaoutput, file="FILENAMEHERE.TXT", row.names=F, sep="\t", quote=F)

It's important that the gmt file and rank file are formatted perfectly as this is the main source of errors. The other is the chip file specification.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by mark.ziemann1.3k

@mark.ziemann thank you so much Mark but it is more tricky !!!! I tried to make the rnk and gmt. for the rnk I need to have two columns while I only have one which you can find an example here 

you can find the geneset file in the main question 

I appreciate any further explanation. thanks again 

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Mo920

The classic mode of GSEA only takes into the rank of the genes and the direction, so if you have 16,000 genes and 8000 have +'ive foldchange and the rest -'ive fold change, you can give them a score from +8000 to -8000 and the GSEA should work if you abide by all formats for rnk/gmt/chip files.

ADD REPLYlink written 6.0 years ago by mark.ziemann1.3k

@mark.ziemann is there any example to do so? I have been googling it for a week !!! still could not find any example or information 

ADD REPLYlink written 6.0 years ago by Mo920

The rank file should have this format:

GeneID    Score
A1BG    0.267702
A2M    -0.0917815
A4GALT    0.290421
AAAS    0.465716
AACS    0.127215
AADAC    -1.00927
AADACL3    0.247068
AADAT    -0.0910159
AAED1    -0.761705
AAGAB    0.143995

If you are having problems with formats, you could try the java GUI to troubleshoot them first.

ADD REPLYlink written 6.0 years ago by mark.ziemann1.3k

@mark.ziemann how can I make those score for my probe set ? are they expression or fold change or something like that ? or should I calculate them based on a formula ? 

ADD REPLYlink written 6.0 years ago by Mo920

The signed log10 pvalue is a good choice reference.

ADD REPLYlink written 6.0 years ago by mark.ziemann1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2200 users visited in the last hour