How Do You Calculate If Two Sets Of Genomic Regions Overlap Significantly?
3
20
Entering edit mode
13.2 years ago
Stew ★ 1.4k

Given two sets of genomic locations, such as two bed files of ChIP-Seq peaks, how would you calculate if they overlap more than would be expected by chance?

I have been trying a couple of methods.

1) Using the BEDTools shuffleBed tool to randomise one or both of the files repeatedly (say 1000 times) then running intersectBed to determine a distribution of expected overlaps. I then work out an empirical p-value based on the number of the random samples that overlaps to a greater degree than my real overlap.

2) Using the GSC (genome structural correction) tools from Encode.

I have been thinking that method 1 limited to specific regions, rather than whole genome, might be a better approach. I think the assumption that the regions could be anywhere on the genome when shuffled is not accurate. This is something that method two tries to address, but I just can not seem to get it work reliably.

So how should I calculate the significance of my overlaps? My aim is to look at the overlap of peaks between biological replicates of CHiP-Seq and between different peaks callers etc.

chip-seq coverage statistics bed • 19k views
ADD COMMENT
0
Entering edit mode

To anyone arriving to this question, at this other question there's a useful list of tools and some more detailed answers

ADD REPLY
7
Entering edit mode
13.2 years ago
Ian 6.0k

When comparing set A to set B, that are from ChIP-seq data, I have used a method where I obtain random regions to resemble B (the same length) from the mapable region of, in my case, the human genome. This data can be obtained from the test UCSC browser 'CRG mapability'. The overlap between A and B, to A and random can be tested by a Fisher Exact Chi-square. This way random regions cannot come from areas of the genome that could never be binding regions.

I am also looking at alternative methods.

You might be interested in the The Genome HyperBrowser http://hyperbrowser.uio.no/hb/. Where you can ask questions just like this. It offers are range of null hypotheses, some complex like maintaining inter-feature distance, and randomising the feature calls, etc. I quite like the look of this and intend to test it out myself.

David Huen (Cambridge, UK) gave a seminar and discussed this problem. He has an R package called 'cooccur', see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910723/ which does a similar sort of comparison.

ADD COMMENT
0
Entering edit mode

Thanks for the answer, I am trying Cooccur now.

ADD REPLY
0
Entering edit mode

Hi! I am also on my way to use Cooccur package, but I am experiencing some problem like the package works pretty well with the example data. but when I load in my data

>library(Cooccur)
a<-read.table("path")
b<-read.table("path")

everything goes is ok till here.

>explore_pairs(c("a","b"))

This is where I get stuck.
ERROR:

Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) :
  invalid 'type' (list) of argument

Thank you, It will be very kind of you if you could email me on my id abhisheksinghnl@gmail.com

ADD REPLY
0
Entering edit mode

make sure your data have exactly three columns named 'seq' 'start' 'end' in that order

ADD REPLY
5
Entering edit mode
13.2 years ago

This is a trickier problem than it first seems, and there appears to be no consensus yet on the best method to do this. For a summary of most methods available, see the review by Fu and Adryan on this topic: http://www.ncbi.nlm.nih.gov/pubmed/19763325

For interval-based data, I second Ian's suggestion about Huen's method, which follows on from work by Haiminen et al.: http://www.ncbi.nlm.nih.gov/pubmed/18691400.

ADD COMMENT
0
Entering edit mode

That review looks interesting, but unfortunately it is not open access, and we do not have a subscription. Peter Bickel's group at Berkeley did a lot of work on this for ENCODE, that we used on the last CHiP-chip paper, but his methods do not seem to be taking off in the CHiP-Seq world.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

And now in PubMedCentral, in case the personal link above disappears: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3475982/

ADD REPLY
1
Entering edit mode
6.0 years ago
bwd ▴ 90

MSPC is designed specifically for this purpose!

Download it from this page, and use it as the following:

MSPC -i rep1.bed -i rep2.bed -i rep3.bed -r biological -s 1E-8 -w 1E-4 -g 1E-9 -c 2 -a 0.005

where

  • -s sets stringent threshold: peaks in input samples with p-value more significant than this threshold, are considered "stringent" peaks;
  • -w sets weak threshold: peaks in input samples with p-values less significant that this threshold, are considered "weak" peaks;
  • -g how strong should be p-value of overlapping peaks so to consider them "significant";
  • -c a peak should overlap with how many peaks at minimum to be considered for analysis;
  • -a BH multiple testing correction.

See this page for a complete list of arguments.

ADD COMMENT

Login before adding your answer.

Traffic: 2758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6