Question: How Do You Calculate If Two Sets Of Genomic Regions Overlap Significantly?
19
gravatar for Stew
8.2 years ago by
Stew1.4k
Cambridge
Stew1.4k wrote:

Given two sets of genomic locations, such as two bed files of ChIP-Seq peaks, how would you calculate if they overlap more than would be expected by chance?

I have been trying a couple of methods.

1) Using the BEDTools shuffleBed tool to randomise one or both of the files repeatedly (say 1000 times) then running intersectBed to determine a distribution of expected overlaps. I then work out an empirical p-value based on the number of the random samples that overlaps to a greater degree than my real overlap.

2) Using the GSC (genome structural correction) tools from Encode.

I have been thinking that method 1 limited to specific regions, rather than whole genome, might be a better approach. I think the assumption that the regions could be anywhere on the genome when shuffled is not accurate. This is something that method two tries to address, but I just can not seem to get it work reliably.

So how should I calculate the significance of my overlaps? My aim is to look at the overlap of peaks between biological replicates of CHiP-Seq and between different peaks callers etc.

ADD COMMENTlink modified 11 months ago by Hmd30 • written 8.2 years ago by Stew1.4k
7
gravatar for Ian
8.2 years ago by
Ian5.4k
University of Manchester, UK
Ian5.4k wrote:

When comparing set A to set B, that are from ChIP-seq data, I have used a method where I obtain random regions to resemble B (the same length) from the mapable region of, in my case, the human genome. This data can be obtained from the test UCSC browser 'CRG mapability'. The overlap between A and B, to A and random can be tested by a Fisher Exact Chi-square. This way random regions cannot come from areas of the genome that could never be binding regions.

I am also looking at alternative methods.

You might be interested in the The Genome HyperBrowser http://hyperbrowser.uio.no/hb/. Where you can ask questions just like this. It offers are range of null hypotheses, some complex like maintaining inter-feature distance, and randomising the feature calls, etc. I quite like the look of this and intend to test it out myself.

David Huen (Cambridge, UK) gave a seminar and discussed this problem. He has an R package called 'cooccur', see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910723/ which does a similar sort of comparison.

ADD COMMENTlink written 8.2 years ago by Ian5.4k

Thanks for the answer, I am trying Cooccur now.

ADD REPLYlink written 8.2 years ago by Stew1.4k

Hi! I am also on my way to use Cooccur package, but I am experiencing som problem like the package works pretty well with the example data. but when I load in my data

library(Cooccur) a<-read.table("path") b<-read.table("path") (it works fine till here, but on next step it gives error) explore_pairs(c("a","b")) ERROR: Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) : invalid 'type' (list) of argument could you please tell me what is going wrong. Thank you for your time and your article on 21 TF in drosophila is quite informative.abhisheksinghnl@gmail.com

ADD REPLYlink written 7.7 years ago by Dataminer2.6k

Hi! [?] I am also on my way to use Cooccur package, but I am experiencing som problem like the package works pretty well with the example data. but when I load in my data

a<-read.table("path") [?] b<-read.table("path") (it works fine till here, but on next step it gives error) [?] explore_pairs(c("a","b")) [?] ERROR: Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) : invalid 'type' (list) of argument could you please tell me what is going wrong. Thank you for your time and your article on 21 TF in drosophila is quite informative.abhisheksinghnl@gmail.com

ADD REPLYlink written 7.7 years ago by Dataminer2.6k

Hi! I am also on my way to use Cooccur package, but I am experiencing som problem like the package works pretty well with the example data. but when I load in my data

library(Cooccur) a<-read.table("path") b<-read.table("path")

everything goes is ok till here.

explore_pairs(c("a","b"))

This is where I get stuck. ERROR: Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) : invalid 'type' (list) of argument Thank you, It will be very kind of you if you could email me on my id abhisheksinghnl@gmail.com

ADD REPLYlink written 7.7 years ago by Dataminer2.6k

make sure your data have exactly three columns named 'seq' 'start' 'end' in that order

ADD REPLYlink written 7.7 years ago by Malcolm.Cook1.0k
5
gravatar for Casey Bergman
8.2 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

This is a trickier problem than it first seems, and there appears to be no consensus yet on the best method to do this. For a summary of most methods available, see the review by Fu and Adryan on this topic: http://www.ncbi.nlm.nih.gov/pubmed/19763325

For interval-based data, I second Ian's suggestion about Huen's method, which follows on from work by Haiminen et al.: http://www.ncbi.nlm.nih.gov/pubmed/18691400.

ADD COMMENTlink written 8.2 years ago by Casey Bergman18k

That review looks interesting, but unfortunately it is not open access, and we do not have a subscription. Peter Bickel's group at Berkeley did a lot of work on this for ENCODE, that we used on the last CHiP-chip paper, but his methods do not seem to be taking off in the CHiP-Seq world.

ADD REPLYlink written 8.2 years ago by Stew1.4k

That review: http://people.pwf.cam.ac.uk/qf205/Papers/FuAdryan2009.pdf

ADD REPLYlink written 7.7 years ago by Malcolm.Cook1.0k

And now in PubMedCentral, in case the personal link above disappears: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3475982/

ADD REPLYlink written 6.1 years ago by David Quigley11k
1
gravatar for Hmd
11 months ago by
Hmd30
Hmd30 wrote:

MSPC is designed specifically for this purpose!

Download it from this page, and use it as the following:

MSPC -i rep1.bed -i rep2.bed -i rep3.bed -r biological -s 1E-8 -w 1E-4 -g 1E-9 -c 2 -a 0.005

where

  • -s sets stringent threshold: peaks in input samples with p-value more significant than this threshold, are considered "stringent" peaks;
  • -w sets weak threshold: peaks in input samples with p-values less significant that this threshold, are considered "weak" peaks;
  • -g how strong should be p-value of overlapping peaks so to consider them "significant";
  • -c a peak should overlap with how many peaks at minimum to be considered for analysis;
  • -a BH multiple testing correction.

See this page for a complete list of arguments.

ADD COMMENTlink written 11 months ago by Hmd30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1645 users visited in the last hour