Question: Which statistical test should be used to test for enrichment of gene lists?
1
6.3 years ago by
Ashutosh10
India, Centre for Cellular and Molecular Biology
Ashutosh10 wrote:

In a typical ChIP-Seq experiment, We found a transcription factor "ABC" peaks on 590 genes. 242 genes out of the 590 genes are classified as "ORFs". If the number of "ORFs" contained in the genome is 5420 and the total number of genes in the genome is 6226. are the total "ABC" bound genes enriched in ORFs? Whether "ABC" association at ORFs is by chance? Which statistical test should be used? Can it be done in R and how? is there any other way to do this?

I got to know from one forum that test for enrichment of gene lists is to do a hypergeometric test or, equivalently, a one-sided Fisher's exact test. Though I am not very familiar with R, based on other examples I tried to use R for Fisher's Exact Test for count data and my output is like this.

> fisher.test(matrix(c(242,5178,348,458),nrow=2,ncol=2),alternative="greater")

Fisher's Exact Test for Count Data

data:  matrix(c(242, 5178, 348, 458), nrow = 2, ncol = 2)

p-value = 1

alternative hypothesis: true odds ratio is greater than 1

95 percent confidence interval:

0.05224986        Inf

sample estimates:

odds ratio

0.06156519

My analysis, If correct suggests that factor "ABC" is present on ORFs only by chance. If the analysis is right what should be the conclusion? Is the above conclusion right? Please help.

chip-seq R significance-test • 9.8k views
modified 6.3 years ago by seidel7.1k • written 6.3 years ago by Ashutosh10
11
6.3 years ago by
seidel7.1k
United States
seidel7.1k wrote:

Yes, you can use Fisher's Exact, hypergeometric, or a variety of other methods to test for enrichment. But for the moment, forget statistics, just look at the data and re-examine your conclusion. If 5420 of 6226 genes are classified as ORFs, and you have 590 binding events, how many of those would you expect to be ORFs? `590 * 5420/6226 = 513`. How many do you actually observe? Answer: 242. Thus you see a depletion of ORFs in your data set. You might turn the question around and ask: are your peaks "enriched" for non-ORFs? The number you would expect by chance is 77, yet you observe 348. If you ask your question this way, you change your matrix as such: `matrix(c(348, 458, 242, 5178), 2, 2)`, and the p-value drops, because the chance of seeing a number "greater" than 348 by chance is very low. The way you asked the question before, was: what's the likelihood of seeing more than 242 ORFs bound by chance? Which gave you a p-value of 1, because the number expected by chance is: 513.

Play with the numbers, and phrase your hypothesis, to get a sense of how it works.

thank you very much for your suggestions. I will do the needful.

1
6.3 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

Your analysis simply shows that there's no enrichment for ORFs in the binding site of the transcription factor (in fact, there's a significant depletion). Not knowing anything about your model organism or the transcription factor in question, it'd be impossible to say anymore than that.