Question: General Considerations For Genomic Overlaps?
gravatar for plfalcon81
5.0 years ago by
plfalcon810 wrote:

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics.

Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this.

E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome.

Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome.

If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A.

But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now.

So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A.

So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A.

For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling).

I hope you can follow this way of thinking.

So the question basically is, is this correct? And how far can I extend my intervals before the overlap becomes artificial? The largest sizes I'm overlapping are ~15% of the genome in dataset B, and this gives me almost all entries from A. This is far higher than in 1,000 simulations.

Any thoughts are appreciated, e.g. is this better to turn it around and make entries in A larger?

ADD COMMENTlink modified 5.0 years ago by dariober9.9k • written 5.0 years ago by plfalcon810
gravatar for dariober
5.0 years ago by
WCIP | Glasgow | UK
dariober9.9k wrote:

If I understand correctly you want to know whether the A intervals are spatially related to the B intervals, right?

Instead of extending the A intervals, I would assign to each A interval the closest B interval and use this distance to compare the real data and with the randomizations.

I think these ideas have been implemented in these packages:


GAT: a simulation framework for testing the association of genomic intervals

ADD COMMENTlink written 5.0 years ago by dariober9.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1253 users visited in the last hour