Question: Association between bed files - statistical significance
gravatar for justin.nelligan
5.7 years ago by
justin.nelligan70 wrote:

I have 2 sets of bed files from ChIP-seq, A and B, and a certain number of peaks overlap. Now I want to put some kind of enrichment score / p-value on that overlap - if only 2 peaks overlap and A and B contain both 10 000 peaks, it's not really significant, while if 90 peaks overlap, A is 10000 peaks and B is 100, it's more significant.

The classical way to do this would be to do a hypergeometric test - all the variables of the test are easily filled in (number of draws, number of successes, number of 'failures') except one, the total population number. In this scenario it would be something similar to 'the total number of peaks one could potentially draw from the genome', which is impossible to estimate, and could be biased (regions that don't chip well, unmappable regions, etc).

What's the best way to put a 'score' on an overlap between A and B then? The ultimate goal is to display some kind of heatmap, as I'm doing the comparison between many A's and many B's.

chip-seq • 5.7k views
ADD COMMENTlink modified 5.1 years ago by bernatgel2.6k • written 5.7 years ago by justin.nelligan70
gravatar for brentp
5.7 years ago by
Salt Lake City, UT
brentp23k wrote:

Something to this effect is available in the latest release of bedtools as bedtools fisher 

It uses bases, rather than counts of peaks. Currently, it doesn't allow you to exclude regions like centromeres or "quiescient" regions that are unlikely to harbor overlaps and I have a PR with that as a work in progress, but it won't be ready for a while.


I have a ticket open to allow the use of counts of intervals, but it will suffer from determining the denominator as you suggest. I think that the total number of possible intervals could be the number of times an interval of the average size could fit into the genome.

For your stuff, you may want to check out this paper: and the associated software.

ADD COMMENTlink written 5.7 years ago by brentp23k

Love the idea of the per-base Fisher test, I'll probably give that a try. I'm already using Intervalstats to calculate the per-peak p-value between A and B - my problem is that in their paper they suggest a summary statistic being 'number of significant peaks in A versus B' divided by 'total number of peaks in A'. All of that is fine in the case where you can compare one set of bed files against themselves, but I'm doing a pairwise comparison between all beds in 'set of bed A' versus 'set of bed B', and depending on the order (A vs B or B vs A) you can have wildly different summary statistics. One example would be if you have 50 significant peaks between A and B, but A has 5000 peaks while B has 100 - the summary statistic changes from 0.01 to 0.5 depending on the order. I was trying to find a way to quantify the relationship between the two in a single value.

ADD REPLYlink written 5.7 years ago by justin.nelligan70

The asymmetry in the IntervalStats algorithm can be useful for biological interpretation. In the example you give, you might imagine A to be something like PolII (found at many places) and B to be an activating transcription factor (found at fewer places but typically with A). Maybe the single metric you're looking for can be found in comparing the rows/columns of the matrix created from all pairwise comparisons?

ADD REPLYlink written 5.7 years ago by Ryan Dale4.9k
gravatar for komal.rathi
5.7 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.6k wrote:

You can calculate the Jaccard Statistic, using Bedtools' Jaccard, between multiple bed files, and plot the statistic value in a heatmap.

Example if you run bedtools jaccard statistic between all your samples, you can have a matrix like this:

name            sample1.bed    sample2.bed    sample3.bed
sample1.bed     1               0.0378736    0.044313
sample2.bed     0.0378736       1            0.270536
sample3.bed     0.044313        0.270536     1

which you can then plot as a heatmap.


Aaronquinlan just posted this tutorial regarding Bedtools' Jaccard, great way of using the jaccard statistic as a metric for comparing (and visualizing) several ChIP-seq datasets.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by komal.rathi3.6k

you can also use 'bedtools jaccard' for this

ADD REPLYlink written 5.7 years ago by Ian5.6k

Ian That's exactly what I have linked to the words 'Jaccard Statistic'.

ADD REPLYlink written 5.7 years ago by komal.rathi3.6k

my mistake!

ADD REPLYlink written 5.7 years ago by Ian5.6k

can you please tell me what would be command for comparing multiple bed files like what you have shown sample2.bed sample2.bed....etc .

Does bedtools jaccard generates these type of matrix?

ADD REPLYlink written 4.9 years ago by Prakash1.9k
gravatar for dariober
5.7 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

Maybe these tools can give you some ideas:

They are all designed to test for spatial correlation between genomic features.

ADD COMMENTlink modified 9 months ago by RamRS27k • written 5.7 years ago by dariober11k
gravatar for bernatgel
5.1 years ago by
Barcelona, Spain
bernatgel2.6k wrote:

The regioneR package in R/Bioconductor can help you with that. It uses a permutation test approach to test the association between two sets of regions. It implements different randomization strategies, all of them taking into account a customizable set of masked regions (unmappable regions, centromeres, repeats...) and different evaluation functions so you can test for overlap, distance, etc...

If your regions are data.frames, GenomicRanges objects or BED-like files in A and B, you can simply use

pt <- overlapPermTest(A=A, B=B, ntimes=100)

and you'll get the association p-value, z-score, etc...

ADD COMMENTlink modified 2.2 years ago • written 5.1 years ago by bernatgel2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour