Question: P Value Or Statistical Significance Of Real Peak Compared To Random Peak Overlaps
12
7.9 years ago by
biorepine1.5k
Spain
biorepine1.5k wrote:

Dear Biostars,

This might be one of the most obvious statistical related question in high-throughput sequencing data analysis. The question is, how one can calculate the enrichment of real versus random regions/peak overlaps?

For ex: The overlap between sox2 peaks and oct peaks is statically significant or not ?

``````My total no.of sox2 peaks = 4000
The no.of sox2 peaks that overlap oct4 = 2500
The no.of random sox2 peaks that overlap oct4 = 20
``````

I agree that above example doesn't even need a statistical test to confirm the enrichment of 2500 over 20. But how one can statistically show this significance of enrichment as a p value per se ?

I was doing some thing like this. Do you think it is correct ? If not could you please suggest a better way ? Many thanx in advance!

``````= log (((The no.of sox2 peaks that overlap oct4 - The no.of random sox2 peaks that overlap oct4)/My total no.of sox2 peaks)*100)
= log ( ( (2500-20) / 4000) 100)
``````
chip-seq • 5.6k views
modified 4.2 years ago by i.sudbery10k • written 7.9 years ago by biorepine1.5k

look at KS test : http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

4
7.9 years ago by
Ian5.7k
University of Manchester, UK
Ian5.7k wrote:

This is a very important question! One that I do not think has been not satisfactorily solved yet!

It has been asked before on Biostars as: A: Annotating chip seq: how to get enrichment over random background and A: How do you calculate if two sets of genomic regions overlap significantly? .

I am still interested in the results of the Genomic Hyperbrowser. But it is not a trivial exercise to determine what the best null model is.

I know the following does not address the statistical analysis, but i think it is important nonetheless:

One of the most important aspects of your question is where the random sequences are coming from. I don't think you stated the origin of yours. I am currently favouring the use of bedtools shuffle that will take your genome coordinates and shuffle them within (or not if you choose) the same chromosome and excluded them from undesirable regions. By undesirable i mean regions of the genome that cannot be sequenced (mappability) or does not contain good sequences (gaps), both of which i obtain from the UCSC Browser.

I look forward to seeing whether anyone offers a good solution to this question!

yes, I used bedtools shuffle using all the chromosomes of mm9 and love to see if you guys also comment on my suggested method as it was showing what I anticipated.

I didn't ignore your method, am just don't feel qualified to comment :)

3
7.9 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Giving statistical advice is a treacherous business as no problem is ever as simple as one thinks - moreover the person asking the question almost never provides the correct and full description of the problem. I noticed that a statistician will never give you an answer straight away, they will say things like: let's talk about it more then they ask a whole bunch of questions some of which are really hard to answer.

In general I like to think in terms of problem categories rather than an exact solution to one particle problem. Your data sounds like a contingency table type so perhaps a Chi-square or Fischer exact test is proper to test for the differences in the proportions.

1

You might be right. I have seen Bing Ren's paper (http://www.ncbi.nlm.nih.gov/pubmed/22763441) using Fisher exact test in their overlapping analysis. However, if I want to compare sox2 and random sox2 peaks peaks with more than one TF peaks (for ex: with oct4, klf4, p300 and cmyc peaks) , fisher test won't work I guess. Anyways, I would love to see if you guys also comment on my suggested method as it was showing what I anticipated.

R has a good (i think) implementation of the Fisher test. You add in a four column table (overlap / no-overlap in both sets, e.g. test.csv) and can run the following:

table <- read.csv("test.csv") fisherList <- apply(table, 1, FUN=function(x) fisher.test(matrix(x,nr=2), workspace=1000000, alternative="two.sided")\$p.value) write(fisherList, file="test_results.txt", sep="\n")

Apparently the Barnard Test is better, but i have not tried it in R yet.

3
7.9 years ago by
Alastair Kerr5.3k
Manchester/UK/Cancer Biomarker Centre at CRUK-MI
Alastair Kerr5.3k wrote:

The data that you describe lends itself to a likelihood ratio test, e.g. Chi-Squared. However some more thought should be applied to defining a proper null hypothesis. Even then, you need to consider having biological replicates.

Have a look at Rory Stark's R-package DiffBind.

How did that "other method in pre-publication" go?

0
4.2 years ago by
i.sudbery10k
Sheffield, UK
i.sudbery10k wrote:

Have a look at the GAT software.