Question

Two different methods for calculating p-value?!

1

Entering edit mode

10.1 years ago

Na Sed ▴ 310

I was seeking the pathway analysis of a gene list and just saw two contrast things which made me too confused.

In https://support.bioconductor.org/p/54827/ the guys discuss enrichment analysis and "Paul Shannon" has suggested the below code to calculate p-values.

library(KEGGREST)
library(org.Hs.eg.db)

# created named list, eg:  path:map00010: "Glycolysis / Gluconeogenesis" 

pathways.list <- keggList("pathway", "hsa")

# make them into KEGG-style human pathway identifiers
pathway.codes <- sub("path:", "", names(pathways.list))

# for demonstration, just use the first ten pathways
# not all pathways exist for human, so TODO: tryCatch the
# keggGet to be robust against those failures

# subsetting by c(TRUE, FALSE) -- which repeats
# as many times as needed, sorts through some
# unexpected packaging of geneIDs in the GENE element
# of each pw[[n]]

# genes.by.pathway <- sapply(pathway.codes,
#                            function(pwid){
#                              pw <- keggGet(pwid)
#                              pw[[1]]$GENE[c(TRUE, FALSE)]
#                            })
load(paste(Data_path, "KEGG_Gene_Pathways.RData"))

all.geneIDs <- keys(org.Hs.eg.db)

# chose one of these for demonstration.  the first (a whole genome random
# set of 100 genes)  has very little enrichment, the second, a random set
# from the pathways themsevles,  has very good enrichment

genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", "51043",
                       "55632", "55643", "55743", "55870", "7314",  "56254",
                       "7316",  "144193","784",   "8837",  "1111",  "84706",
                       "200931","169522","5707",  "5091",  "5901",  "55532",
                       "9777")# the hypergeometric distribution is traditionally explained in terms of

# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also

pVals.by.pathway <- sapply(names(genes.by.pathway),
                           function(pathway) {
                             pathway.genes <- genes.by.pathway[[pathway]]
                             white.balls.drawn <- length(intersect(genes.of.interest, pathway.genes))
                             white.balls.in.urn <- length(pathway.genes)
                             total.balls.in.urn <- length(all.geneIDs)
                             black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
                             total.balls.drawn.from.urn <- length(genes.of.interest)
                             dhyper(white.balls.drawn,
                                    white.balls.in.urn,
                                    black.balls.in.urn,
                                    total.balls.drawn.from.urn)
                           })

print(pVals.by.pathway)

As you know, dhyper returns

On the other side, http://www.tongji.edu.cn/~qiliu/help/help_3_ORA.html provides a complete definition of pathway over-representation analysis, and says the p-value is calculated as

Actually, my problem is not about the summation over 'x', but it is about the subtraction of the formula from '1'.

Which of them is true? Does anyone know about it?

p-value pathway • 2.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Na Sed ▴ 310

Ram · Accepted Answer · 2015-06-13

It's shown in the notation. You have to read the whole paragraph to understand the difference.

The first formula is the probability that X=k. That's not a hypothesis test p-value at all. If you want to know how uncommon such an event is, you would sum the probabilities of the more unlikely events, and take the difference from 100%. That's what's happening in the second formula.

In the second formula it's adding up the probabilities of some events, and showing the difference from 100%, or the rareness instead of the likeness.