Question: Need Advice On The Correct Usage For The Genome Structure Correction Statistical Software From The Encode Project
6
gravatar for Chris Whelan
7.2 years ago by
Chris Whelan550
Portland, OR
Chris Whelan550 wrote:

I'm wondering if anyone has successfully used the Genome Structure Correction statistical software from the ENCODE project (described in the supplement to the Birney et al. 2007 ENCODE Nature paper, and in Bickel et al. Ann Appl Stat 2011), and could offer some advice.

I've got a list of genomic regions, and I'd like to determine if they are enriched for various genomic features (i.e. segmental duplications, repeats, genes, etc.) in a statistically significant way. Typically I do this by permuting the locations of my regions using shuffleBed from BEDTools many times to establish the null distribution of overlaps between a set of similarly sized regions and the features of interest, and then computing the quantile of the actual observed value. However, I've seen some criticisms of this approach based on the argument that randomly permuting genomic locations ignores local properties of the genome, and I think that GSC is supposed to address this problem.

So, using the block-bootstrap-0.8.1 code at http://www.encodestatistics.org/ , I tried to compute a p-value for overlap between my regions and a feature file.

block_bootstrap.py -1 features.bed -2 my_regions.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B

complement_gaps.bed was created using complementBed on the "gaps" track from UCSC.

I got a low p-value (0.0!), which was encouraging, but I wanted to test it, so I shuffled my regions with shuffleBed:

shuffleBed -i my_regions.bed -g hg19.chrom.sizes > my_regions_shuffled.bed
block_bootstrap.py -1 features.bed -2 my_regions_shuffled.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B

Unfortunately this also gives me a very low p-value, even after repeating it many times. I'd expect shuffled files to generally not have a significant overlap with my features file.

1) Am I using block-bootstrap correctly?

2) My regions are relatively few (~100) and large (~10kb - 200kb). I didn't see anything in the GSC papers and documentation on what type of data works or doesn't work for the test, but it seems like ENCODE mostly is using it on ChIP-SEQ data sets with more, smaller regions of interest. Is this test not appropriate for data like mine?

3) I guessed at how to set the -r parameter, does that make a big difference? Does anyone have any guidelines on how to set parameters for this test? I didn't get much from the papers or documentation.

enrichment encode statistics • 4.9k views
ADD COMMENTlink modified 6.4 years ago by Istvan Albert ♦♦ 81k • written 7.2 years ago by Chris Whelan550
1

you have accidentally included a comma in your encodestatistics link. I think the best way to get these questions answered wou be to email the authors of the paper. It would be appreciated if you posted the response here though so other people seeking the same answers will see it.

ADD REPLYlink written 7.2 years ago by Ying W3.9k

Added some whitespace after the URL to get rid of the comma automatically added to the link.

I did write to the author of the software but didn't hear back. If anyone else makes any progress I'd love to hear about it.

ADD REPLYlink written 7.0 years ago by Chris Whelan550

Hi Chris, when you see a low p-value for your shuffled BED file, did you see a negative or positive Z-score?

ADD REPLYlink written 7.2 years ago by shenli.sam190

I'm getting very negative Z scores for both tests, in the -11 , -12 range.

ADD REPLYlink written 7.0 years ago by Chris Whelan550

Dear Chris, I face the similar problem! So do you know how to use the GSC software now? Thank you!

ADD REPLYlink written 6.5 years ago by pengcui198930

No, I never did figure GSC out. I'm doing location permutations using bedtools shufflebed and also trying to use the GenometriCorr R package to accomplish this task now.

ADD REPLYlink written 6.4 years ago by Chris Whelan550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 544 users visited in the last hour