Question: GSEA with ranked list
gravatar for chloe.p.oconnell
5.9 years ago by
United States
chloe.p.oconnell80 wrote:



I'm trying to run GSEA on a ranked list of genes. In other words, I'm not using expression data (instead, I'm using a list of genes ranked by the prevalence of variants in those genes in my dataset). I can't figure out how to run GSEA using non-standard input files - either the desktop version or the R version. Each tutorial I can find details how to run GSEA on expression files that contain expression levels from each individual subject, while I already have a list of genes I'm interested in. 

ADD COMMENTlink written 5.9 years ago by chloe.p.oconnell80

You could just pretend that the rankings are expression levels (you may have to reverse the ordering such that the most prevalently affected gene has the highest number). One of the first calls in any GSEA function is rank(), afterall. If that doesn't seem to be working well for you then let me know and I can just post some R code.

ADD REPLYlink written 5.9 years ago by Devon Ryan96k

Thanks for the help. For some reason, it isn't working correctly. I'm assuming this is my issue, as my coding background is rather weak. I'll keep trying...

ADD REPLYlink written 5.9 years ago by chloe.p.oconnell80

You can also try directly doing a ks.test() as lkmklsmn mentioned.

BTW, regardless of the test you end up using, do have a look at the results yourself. Tests like this that compare distributions have some known issues when it comes to finding statistically significant but likely biologically meaningless results.

ADD REPLYlink written 5.9 years ago by Devon Ryan96k
gravatar for lkmklsmn
5.9 years ago by
United States
lkmklsmn930 wrote:

The GSEA algorithm is based on the Kolmogorov-Smirnov statistical test. This method test for a shift in ranks between a set of interest and the background. You would basically be asking the question, is this particular set of genes enriched among the top genes in the ranked list of all genes?  

This is fairly simple to do in R. The code would look like this (not run):  

scores<- a numeric vector of your scores (prevalence of variants) of all genes in your dataset  


ind<- a numeric vector containing the indices of your gene set in scores  






ADD COMMENTlink modified 5.9 years ago by Istvan Albert ♦♦ 84k • written 5.9 years ago by lkmklsmn930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour