Two different methods for calculating p-value?!
1
1
Entering edit mode
8.9 years ago
Na Sed ▴ 310

I was seeking the pathway analysis of a gene list and just saw two contrast things which made me too confused.

In https://support.bioconductor.org/p/54827/ the guys discuss enrichment analysis and "Paul Shannon" has suggested the below code to calculate p-values.

library(KEGGREST)
library(org.Hs.eg.db)

# created named list, eg:  path:map00010: "Glycolysis / Gluconeogenesis" 

pathways.list <- keggList("pathway", "hsa")

# make them into KEGG-style human pathway identifiers
pathway.codes <- sub("path:", "", names(pathways.list))

# for demonstration, just use the first ten pathways
# not all pathways exist for human, so TODO: tryCatch the
# keggGet to be robust against those failures

# subsetting by c(TRUE, FALSE) -- which repeats
# as many times as needed, sorts through some
# unexpected packaging of geneIDs in the GENE element
# of each pw[[n]]

# genes.by.pathway <- sapply(pathway.codes,
#                            function(pwid){
#                              pw <- keggGet(pwid)
#                              pw[[1]]$GENE[c(TRUE, FALSE)]
#                            })
load(paste(Data_path, "KEGG_Gene_Pathways.RData"))

all.geneIDs <- keys(org.Hs.eg.db)

# chose one of these for demonstration.  the first (a whole genome random
# set of 100 genes)  has very little enrichment, the second, a random set
# from the pathways themsevles,  has very good enrichment

genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", "51043",
                       "55632", "55643", "55743", "55870", "7314",  "56254",
                       "7316",  "144193","784",   "8837",  "1111",  "84706",
                       "200931","169522","5707",  "5091",  "5901",  "55532",
                       "9777")# the hypergeometric distribution is traditionally explained in terms of

# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also

pVals.by.pathway <- sapply(names(genes.by.pathway),
                           function(pathway) {
                             pathway.genes <- genes.by.pathway[[pathway]]
                             white.balls.drawn <- length(intersect(genes.of.interest, pathway.genes))
                             white.balls.in.urn <- length(pathway.genes)
                             total.balls.in.urn <- length(all.geneIDs)
                             black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
                             total.balls.drawn.from.urn <- length(genes.of.interest)
                             dhyper(white.balls.drawn,
                                    white.balls.in.urn,
                                    black.balls.in.urn,
                                    total.balls.drawn.from.urn)
                           })

print(pVals.by.pathway)

As you know, dhyper returns

On the other side, http://www.tongji.edu.cn/~qiliu/help/help_3_ORA.html provides a complete definition of pathway over-representation analysis, and says the p-value is calculated as

Actually, my problem is not about the summation over 'x', but it is about the subtraction of the formula from '1'.

Which of them is true? Does anyone know about it?

p-value pathway • 2.5k views
ADD COMMENT
5
Entering edit mode
8.9 years ago

It's shown in the notation. You have to read the whole paragraph to understand the difference.

The first formula is the probability that X=k. That's not a hypothesis test p-value at all. If you want to know how uncommon such an event is, you would sum the probabilities of the more unlikely events, and take the difference from 100%. That's what's happening in the second formula.

In the second formula it's adding up the probabilities of some events, and showing the difference from 100%, or the rareness instead of the likeness.

ADD COMMENT

Login before adding your answer.

Traffic: 2901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6