P Value Or Statistical Significance Of Real Peak Compared To Random Peak Overlaps
4
12
Entering edit mode
11.1 years ago
biorepine ★ 1.5k

Dear Biostars,

This might be one of the most obvious statistical related question in high-throughput sequencing data analysis. The question is, how one can calculate the enrichment of real versus random regions/peak overlaps?

For ex: The overlap between sox2 peaks and oct peaks is statically significant or not ?

My total no.of sox2 peaks = 4000
The no.of sox2 peaks that overlap oct4 = 2500
The no.of random sox2 peaks that overlap oct4 = 20

I agree that above example doesn't even need a statistical test to confirm the enrichment of 2500 over 20. But how one can statistically show this significance of enrichment as a p value per se ?

I was doing some thing like this. Do you think it is correct ? If not could you please suggest a better way ? Many thanx in advance!

= log (((The no.of sox2 peaks that overlap oct4 - The no.of random sox2 peaks that overlap oct4)/My total no.of sox2 peaks)*100)
= log ( ( (2500-20) / 4000) 100)
chip-seq • 7.3k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
4
Entering edit mode
11.1 years ago
Ian 6.0k

This is a very important question! One that I do not think has been not satisfactorily solved yet!

It has been asked before on Biostars as: A: Annotating chip seq: how to get enrichment over random background and A: How do you calculate if two sets of genomic regions overlap significantly? .

I am still interested in the results of the Genomic Hyperbrowser. But it is not a trivial exercise to determine what the best null model is.

I know the following does not address the statistical analysis, but i think it is important nonetheless:

One of the most important aspects of your question is where the random sequences are coming from. I don't think you stated the origin of yours. I am currently favouring the use of bedtools shuffle that will take your genome coordinates and shuffle them within (or not if you choose) the same chromosome and excluded them from undesirable regions. By undesirable i mean regions of the genome that cannot be sequenced (mappability) or does not contain good sequences (gaps), both of which i obtain from the UCSC Browser.

I look forward to seeing whether anyone offers a good solution to this question!

ADD COMMENT
0
Entering edit mode

yes, I used bedtools shuffle using all the chromosomes of mm9 and love to see if you guys also comment on my suggested method as it was showing what I anticipated.

ADD REPLY
0
Entering edit mode

I didn't ignore your method, am just don't feel qualified to comment :)

ADD REPLY
3
Entering edit mode
11.1 years ago

Giving statistical advice is a treacherous business as no problem is ever as simple as one thinks - moreover the person asking the question almost never provides the correct and full description of the problem. I noticed that a statistician will never give you an answer straight away, they will say things like: let's talk about it more then they ask a whole bunch of questions some of which are really hard to answer.

In general I like to think in terms of problem categories rather than an exact solution to one particle problem. Your data sounds like a contingency table type so perhaps a Chi-square or Fischer exact test is proper to test for the differences in the proportions.

ADD COMMENT
1
Entering edit mode

You might be right. I have seen Bing Ren's paper (http://www.ncbi.nlm.nih.gov/pubmed/22763441) using Fisher exact test in their overlapping analysis. However, if I want to compare sox2 and random sox2 peaks peaks with more than one TF peaks (for ex: with oct4, klf4, p300 and cmyc peaks) , fisher test won't work I guess. Anyways, I would love to see if you guys also comment on my suggested method as it was showing what I anticipated.

ADD REPLY
0
Entering edit mode

R has a good (i think) implementation of the Fisher test. You add in a four column table (overlap / no-overlap in both sets, e.g. test.csv) and can run the following:

table <- read.csv("test.csv") fisherList <- apply(table, 1, FUN=function(x) fisher.test(matrix(x,nr=2), workspace=1000000, alternative="two.sided")$p.value) write(fisherList, file="test_results.txt", sep="\n")

Apparently the Barnard Test is better, but i have not tried it in R yet.

ADD REPLY
3
Entering edit mode
11.1 years ago

The data that you describe lends itself to a likelihood ratio test, e.g. Chi-Squared. However some more thought should be applied to defining a proper null hypothesis. Even then, you need to consider having biological replicates.

Have a look at Rory Stark's R-package DiffBind.

ADD COMMENT
0
Entering edit mode

How did that "other method in pre-publication" go?

ADD REPLY
0
Entering edit mode
7.4 years ago

Have a look at the GAT software.

ADD COMMENT

Login before adding your answer.

Traffic: 3030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6