Question: Which statistical test should be used to test for enrichment of gene lists?
gravatar for Ashutosh
3.0 years ago by
India, Centre for Cellular and Molecular Biology
Ashutosh10 wrote:

In a typical ChIP-Seq experiment, We found a transcription factor "ABC" peaks on 590 genes. 242 genes out of the 590 genes are classified as "ORFs". If the number of "ORFs" contained in the genome is 5420 and the total number of genes in the genome is 6226. are the total "ABC" bound genes enriched in ORFs? Whether "ABC" association at ORFs is by chance? Which statistical test should be used? Can it be done in R and how? is there any other way to do this?

I got to know from one forum that test for enrichment of gene lists is to do a hypergeometric test or, equivalently, a one-sided Fisher's exact test. Though I am not very familiar with R, based on other examples I tried to use R for Fisher's Exact Test for count data and my output is like this.

> fisher.test(matrix(c(242,5178,348,458),nrow=2,ncol=2),alternative="greater")

 Fisher's Exact Test for Count Data

data:  matrix(c(242, 5178, 348, 458), nrow = 2, ncol = 2)

p-value = 1

alternative hypothesis: true odds ratio is greater than 1

95 percent confidence interval:

 0.05224986        Inf

sample estimates:

odds ratio


My analysis, If correct suggests that factor "ABC" is present on ORFs only by chance. If the analysis is right what should be the conclusion? Is the above conclusion right? Please help.


chip-seq R significance-test • 5.8k views
ADD COMMENTlink modified 3.0 years ago by seidel6.1k • written 3.0 years ago by Ashutosh10
gravatar for seidel
3.0 years ago by
United States
seidel6.1k wrote:

Yes, you can use Fisher's Exact, hypergeometric, or a variety of other methods to test for enrichment. But for the moment, forget statistics, just look at the data and re-examine your conclusion. If 5420 of 6226 genes are classified as ORFs, and you have 590 binding events, how many of those would you expect to be ORFs? 590 * 5420/6226 = 513. How many do you actually observe? Answer: 242. Thus you see a depletion of ORFs in your data set. You might turn the question around and ask: are your peaks "enriched" for non-ORFs? The number you would expect by chance is 77, yet you observe 348. If you ask your question this way, you change your matrix as such: matrix(c(348, 458, 242, 5178), 2, 2), and the p-value drops, because the chance of seeing a number "greater" than 348 by chance is very low. The way you asked the question before, was: what's the likelihood of seeing more than 242 ORFs bound by chance? Which gave you a p-value of 1, because the number expected by chance is: 513.

Play with the numbers, and phrase your hypothesis, to get a sense of how it works.

ADD COMMENTlink written 3.0 years ago by seidel6.1k

thank you very much for your suggestions. I will do the needful.

ADD REPLYlink written 3.0 years ago by Ashutosh10
gravatar for Devon Ryan
3.0 years ago by
Devon Ryan68k
Freiburg, Germany
Devon Ryan68k wrote:

Your analysis simply shows that there's no enrichment for ORFs in the binding site of the transcription factor (in fact, there's a significant depletion). Not knowing anything about your model organism or the transcription factor in question, it'd be impossible to say anymore than that.

ADD COMMENTlink written 3.0 years ago by Devon Ryan68k

thanks Devon. my model organism is yeast and I am looking for a histone chaperon occupancy on ORFs.

ADD REPLYlink written 3.0 years ago by Ashutosh10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 815 users visited in the last hour