Question

Association between bed files - statistical significance

6

Entering edit mode

9.5 years ago

justin.nelligan ▴ 70

I have 2 sets of bed files from ChIP-seq, A and B, and a certain number of peaks overlap. Now I want to put some kind of enrichment score / p-value on that overlap - if only 2 peaks overlap and A and B contain both 10 000 peaks, it's not really significant, while if 90 peaks overlap, A is 10000 peaks and B is 100, it's more significant.

The classical way to do this would be to do a hypergeometric test - all the variables of the test are easily filled in (number of draws, number of successes, number of 'failures') except one, the total population number. In this scenario it would be something similar to 'the total number of peaks one could potentially draw from the genome', which is impossible to estimate, and could be biased (regions that don't chip well, unmappable regions, etc).

What's the best way to put a 'score' on an overlap between A and B then? The ultimate goal is to display some kind of heatmap, as I'm doing the comparison between many A's and many B's.

ChIP-Seq • 9.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 9.5 years ago by justin.nelligan ▴ 70

9

Entering edit mode

9.5 years ago

komal.rathi ★ 4.1k

You can calculate the Jaccard Statistic, using Bedtools Jaccard, between multiple bed files, and plot the statistic value in a heatmap.

Example if you run bedtools jaccard statistic between all your samples, you can have a matrix like this:

name            sample1.bed    sample2.bed    sample3.bed
sample1.bed     1               0.0378736    0.044313
sample2.bed     0.0378736       1            0.270536
sample3.bed     0.044313        0.270536     1

which you can then plot as a heatmap.

Update:

Aaronquinlan just posted this tutorial regarding Bedtools Jaccard, great way of using the jaccard statistic as a metric for comparing (and visualizing) several ChIP-seq datasets.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

you can also use 'bedtools jaccard' for this

ADD REPLY • link 9.5 years ago by Ian 6.0k

0

Entering edit mode

Ian That's exactly what I have linked to the words 'Jaccard Statistic'.

ADD REPLY • link 9.5 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

my mistake!

ADD REPLY • link 9.5 years ago by Ian 6.0k

0

Entering edit mode

Can you please tell me what would be command for comparing multiple bed files like what you have shown sample2.bed sample2.bed....etc.

Does bedtools jaccard generates these type of matrix?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Prakash ★ 2.2k

6

Entering edit mode

9.5 years ago

dariober 14k

Maybe these tools can give you some ideas:

They are all designed to test for spatial correlation between genomic features.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.5 years ago by dariober 14k

4

Entering edit mode

8.9 years ago

bernatgel ★ 3.4k

The regioneR package in R/Bioconductor can help you with that. It uses a permutation test approach to test the association between two sets of regions. It implements different randomization strategies, all of them taking into account a customizable set of masked regions (unmappable regions, centromeres, repeats...) and different evaluation functions so you can test for overlap, distance, etc...

If your regions are data.frames, GenomicRanges objects or BED-like files in A and B, you can simply use

library(regioneR)
pt <- overlapPermTest(A=A, B=B, ntimes=100)
plot(pt)

and you'll get the association p-value, z-score, etc...

ADD COMMENT • link 6.0 years ago by bernatgel ★ 3.4k

Ram · Accepted Answer · 2014-11-18

11

Entering edit mode

9.5 years ago

brentp 24k

Something to this effect is available in the latest release of bedtools as bedtools fisher.

It uses bases, rather than counts of peaks. Currently, it doesn't allow you to exclude regions like centromeres or "quiescient" regions that are unlikely to harbor overlaps and I have a PR with that as a work in progress, but it won't be ready for a while.

I have a ticket open to allow the use of counts of intervals, but it will suffer from determining the denominator as you suggest. I think that the total number of possible intervals could be the number of times an interval of the average size could fit into the genome.

For your stuff, you may want to check out this paper and the associated software.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by brentp 24k

1

Entering edit mode

Love the idea of the per-base Fisher test, I'll probably give that a try. I'm already using Intervalstats to calculate the per-peak p-value between A and B - my problem is that in their paper they suggest a summary statistic being 'number of significant peaks in A versus B' divided by 'total number of peaks in A'. All of that is fine in the case where you can compare one set of bed files against themselves, but I'm doing a pairwise comparison between all beds in 'set of bed A' versus 'set of bed B', and depending on the order (A vs B or B vs A) you can have wildly different summary statistics. One example would be if you have 50 significant peaks between A and B, but A has 5000 peaks while B has 100 - the summary statistic changes from 0.01 to 0.5 depending on the order. I was trying to find a way to quantify the relationship between the two in a single value.

ADD REPLY • link 9.5 years ago by justin.nelligan ▴ 70

1

Entering edit mode

The asymmetry in the IntervalStats algorithm can be useful for biological interpretation. In the example you give, you might imagine A to be something like PolII (found at many places) and B to be an activating transcription factor (found at fewer places but typically with A). Maybe the single metric you're looking for can be found in comparing the rows/columns of the matrix created from all pairwise comparisons?

ADD REPLY • link 9.5 years ago by Ryan Dale 5.0k