Hypergeometric test of gene set enrichment - Calculate P-value for multiple gene sets in one script?
0
2
Entering edit mode
9.7 years ago

Hi,

I have around 100 different gene sets (as tab-delimited files, no ranking) and would like to see whether my experimental gene list is enriched for any of those gene sets, by hypergeometric testing in R. (This is a custom genome (Toxoplasma gondii), I can't use e.g. PANTHER, DAVID, ReactomePA as far as I can tell). I'm an R novice and am lost as to how to do this test for multiple gene sets. To test one gene set, I've been using (a bit long-winded):

dhyperRandom <- function(myGeneList, myGeneSet, genome){
myRandomGS <- sample( genome,size=length(myGeneSet) )
myX <- length(which(myGeneList %in% myRandomGS))
myM <- length(myRandomGS)
myN <- length(genome) - length(myM)
myK <- length(myGeneList)
return(dhyper(x=myX, m=myM, n=myN, k=myK))
 }
for(i in 1:1000){
 pvalue[i] <- dhyperRandom( myGeneList, myGeneSet, genome )
}

mean(pvalue)

I could go through each gene set individually...but there must be a way of automating this process and reporting the data in a single table. I'd be very grateful for any suggestions!

Natalie

next-gen R GO • 6.8k views
ADD COMMENT
1
Entering edit mode

As you did loop for permutations, write similar loop for gene lists. I would try something like this:

foreach(my.list=allGeneSets, .combine=rbind) %do% {
    myGeneSet <- fread(my.list)
    for(i in 1:1000){
        pvalue[i] <- dhyperRandom(myGeneList, myGeneSet, genome)
    }
}

You should have paths to your datasets in allGeneSets. This is not tested, suggestions to optimize are welcome :-)

ADD REPLY
0
Entering edit mode

Thanks a lot for your suggestion Pgibas! I pasted the random gene set iterations but perhaps should start with just a regular hypergeometric test on each gene set. I tried what you suggested but having some problems. This is the code I'm trying:

#Set paths to files
allGeneSets<-list.files("path", full.names=TRUE)
genome<-t(fread("path"))
myGeneList<-fread("path")
hypervalues<-c()
dhyper_my <- function(myGeneList, myGeneSet, genome){
myX <- length(which(myGeneList %in% myGeneSet))
myM <- length(myGeneSet)
myN <- length(genome) - length(myM)
myK <- length(myGeneList)
return(dhyper(x=myX, m=myM, n=myN, k=myK))
 }
foreach(my.list=allGeneSets, .combine=rbind) %do% {
    myGeneSet <- fread(my.list)
    for(i in 1:73){
        hypervalues[i] <- dhyper_my( myGeneList, myGeneSet, genome )
    }
}

Output:

NULL, In dhyper(x = myX, m = myM, n = myN, k = myK) : NaNs produced

When I do length(which(myGeneList %in% myGeneSet)) separately I get the wrong number... This is what the files look like that I'm reading in:

TGME49_259010    TGME49_285240    TGME49_214440    TGME49_273740    TGME49_288450    TGME49_263130    TGME49_268850    TGME49_238950    TGME49_318580    TGME49_249390    TGME49_273490    TGME49_275568    TGME49_226400    TGME49_262760    TGME49_284190

Sorry if these are stupid questions! Like I said, R newbie :)

ADD REPLY

Login before adding your answer.

Traffic: 837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6