Appropriate statistical test for overlap analyses?
1
2
Entering edit mode
4.5 years ago
mmmmcandrew ▴ 130

Hi all-

This is somewhat similar to some older Biostars posts, but most of those are quite old and I'm wondering if there has been some consensus since.

I have been doing a number of analyses looking at overlap between regions of interest, and I would now like to apply a statistical test to determine if the overlap between these regions is significant. I am thinking that I can compare to overlap with random size-matched regions to generate a contingency table like this:

                 Regions of interest                  Random regions
Overlapping                value 1                         value 2
Non-overlapping            value 3                         value 4


Then run a chi-square test. Does this seem reasonable to everyone else? Or will I get dinged by reviewers?

statistics overlap bedtools chi square • 3.9k views
1
Entering edit mode

Bedtools has three commands to test for statistical relationships (jaccard, reldist and fisher) between feature files.

2
Entering edit mode
4.5 years ago

I have just thought about this setting a bit, and indeed there were some questions asking for "significant overlap" or similar. It might be surprising, but the result of my little analysis here is that the model of "significant interval overlap" is irrelevant to model any biological problem on a genome-wide scale; this is mainly due to the fact that it does not take into account or model correctly how the intervals are generated.

Remark: Under independent uniform distribution of short intervals on a genome, any single overlap of any small intervals is significant, even more so any number of overlaps >1. For larger numbers of intervals, we can correct for multiple testing (Lander-Waterman). To assess the significance, we need to calculate the probability of an event occurring just by chance. The probability P of a single overlap of two depends on the size of the genome and the intervals and is P ~ 2s/G for small numbers of intervals s: interval size, G genome size.

Now, the problem is that the distribution of intervals is generated by a process that is neither independent nor uniform. You did not say which process generates your intervals, but it seems to be a manually curated process. Your contingency table (the example is made up) would not help you in this case, because the entry for randomly overlapping intervals will be always 0 or close to 0 unless you reach large coverage.