I'm wondering if anyone has successfully used the Genome Structure Correction statistical software from the ENCODE project (described in the supplement to the Birney et al. 2007 ENCODE Nature paper, and in Bickel et al. Ann Appl Stat 2011), and could offer some advice.
I've got a list of genomic regions, and I'd like to determine if they are enriched for various genomic features (i.e. segmental duplications, repeats, genes, etc.) in a statistically significant way. Typically I do this by permuting the locations of my regions using shuffleBed from BEDTools many times to establish the null distribution of overlaps between a set of similarly sized regions and the features of interest, and then computing the quantile of the actual observed value. However, I've seen some criticisms of this approach based on the argument that randomly permuting genomic locations ignores local properties of the genome, and I think that GSC is supposed to address this problem.
So, using the block-bootstrap-0.8.1 code at http://www.encodestatistics.org/ , I tried to compute a p-value for overlap between my regions and a feature file.
block_bootstrap.py -1 features.bed -2 my_regions.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B
complement_gaps.bed was created using complementBed on the "gaps" track from UCSC.
I got a low p-value (0.0!), which was encouraging, but I wanted to test it, so I shuffled my regions with shuffleBed:
shuffleBed -i my_regions.bed -g hg19.chrom.sizes > my_regions_shuffled.bed block_bootstrap.py -1 features.bed -2 my_regions_shuffled.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B
Unfortunately this also gives me a very low p-value, even after repeating it many times. I'd expect shuffled files to generally not have a significant overlap with my features file.
1) Am I using block-bootstrap correctly?
2) My regions are relatively few (~100) and large (~10kb - 200kb). I didn't see anything in the GSC papers and documentation on what type of data works or doesn't work for the test, but it seems like ENCODE mostly is using it on ChIP-SEQ data sets with more, smaller regions of interest. Is this test not appropriate for data like mine?
3) I guessed at how to set the -r parameter, does that make a big difference? Does anyone have any guidelines on how to set parameters for this test? I didn't get much from the papers or documentation.