Question: How Do You Calculate If Two Sets Of Genomic Regions Overlap Significantly?
gravatar for Stew
9.6 years ago by
Stew1.4k wrote:

Given two sets of genomic locations, such as two bed files of ChIP-Seq peaks, how would you calculate if they overlap more than would be expected by chance?

I have been trying a couple of methods.

1) Using the BEDTools shuffleBed tool to randomise one or both of the files repeatedly (say 1000 times) then running intersectBed to determine a distribution of expected overlaps. I then work out an empirical p-value based on the number of the random samples that overlaps to a greater degree than my real overlap.

2) Using the GSC (genome structural correction) tools from Encode.

I have been thinking that method 1 limited to specific regions, rather than whole genome, might be a better approach. I think the assumption that the regions could be anywhere on the genome when shuffled is not accurate. This is something that method two tries to address, but I just can not seem to get it work reliably.

So how should I calculate the significance of my overlaps? My aim is to look at the overlap of peaks between biological replicates of CHiP-Seq and between different peaks callers etc.

ADD COMMENTlink modified 2.4 years ago by bwd40 • written 9.6 years ago by Stew1.4k

To anyone arriving to this question, at this other question there's a useful list of tools and some more detailed answers

ADD REPLYlink modified 11 months ago • written 11 months ago by bernatgel2.6k
gravatar for Ian
9.6 years ago by
University of Manchester, UK
Ian5.7k wrote:

When comparing set A to set B, that are from ChIP-seq data, I have used a method where I obtain random regions to resemble B (the same length) from the mapable region of, in my case, the human genome. This data can be obtained from the test UCSC browser 'CRG mapability'. The overlap between A and B, to A and random can be tested by a Fisher Exact Chi-square. This way random regions cannot come from areas of the genome that could never be binding regions.

I am also looking at alternative methods.

You might be interested in the The Genome HyperBrowser Where you can ask questions just like this. It offers are range of null hypotheses, some complex like maintaining inter-feature distance, and randomising the feature calls, etc. I quite like the look of this and intend to test it out myself.

David Huen (Cambridge, UK) gave a seminar and discussed this problem. He has an R package called 'cooccur', see which does a similar sort of comparison.

ADD COMMENTlink written 9.6 years ago by Ian5.7k

Thanks for the answer, I am trying Cooccur now.

ADD REPLYlink written 9.6 years ago by Stew1.4k

Hi! I am also on my way to use Cooccur package, but I am experiencing some problem like the package works pretty well with the example data. but when I load in my data


everything goes is ok till here.


This is where I get stuck.

Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) :
  invalid 'type' (list) of argument

Thank you, It will be very kind of you if you could email me on my id

ADD REPLYlink modified 12 months ago by RamRS30k • written 9.1 years ago by Dataminer2.7k

make sure your data have exactly three columns named 'seq' 'start' 'end' in that order

ADD REPLYlink written 9.1 years ago by Malcolm.Cook1.1k
gravatar for Casey Bergman
9.6 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

This is a trickier problem than it first seems, and there appears to be no consensus yet on the best method to do this. For a summary of most methods available, see the review by Fu and Adryan on this topic:

For interval-based data, I second Ian's suggestion about Huen's method, which follows on from work by Haiminen et al.:

ADD COMMENTlink written 9.6 years ago by Casey Bergman18k

That review looks interesting, but unfortunately it is not open access, and we do not have a subscription. Peter Bickel's group at Berkeley did a lot of work on this for ENCODE, that we used on the last CHiP-chip paper, but his methods do not seem to be taking off in the CHiP-Seq world.

ADD REPLYlink written 9.6 years ago by Stew1.4k

That review:

ADD REPLYlink written 9.1 years ago by Malcolm.Cook1.1k

And now in PubMedCentral, in case the personal link above disappears:

ADD REPLYlink written 7.6 years ago by David Quigley11k
gravatar for bwd
2.4 years ago by
bwd40 wrote:

MSPC is designed specifically for this purpose!

Download it from this page, and use it as the following:

MSPC -i rep1.bed -i rep2.bed -i rep3.bed -r biological -s 1E-8 -w 1E-4 -g 1E-9 -c 2 -a 0.005


  • -s sets stringent threshold: peaks in input samples with p-value more significant than this threshold, are considered "stringent" peaks;
  • -w sets weak threshold: peaks in input samples with p-values less significant that this threshold, are considered "weak" peaks;
  • -g how strong should be p-value of overlapping peaks so to consider them "significant";
  • -c a peak should overlap with how many peaks at minimum to be considered for analysis;
  • -a BH multiple testing correction.

See this page for a complete list of arguments.

ADD COMMENTlink written 2.4 years ago by bwd40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 695 users visited in the last hour