Question: Probability of overlap between sequences?
gravatar for Rich
4.1 years ago by
Rich30 wrote:


I'm sorry if this question is too naive but maybe you can advise me the right way to solve a problem.

I have a 50 kb region on chromosome 6 (length 171,115,067 bp). It turned out to be, that this region completely overlaps with one specific site (5 kb). However, there are about 250 such specific sites across the entire chr 6.

How can I calculate chances that such overlap between my 50 kb region and one of such 5 kb sites happens just randomly? Is there any simple formula? Or I need to generate randomly 10 or 100 thousands of 50 kb regions from chr 6 and then calculate how many times such randomly generated regions overlap with the 5 kb sites?

And if the later way is the right one is there any tool that could generate such sequences from the given human chromosome?

Thank you!

sequencing probability • 1.5k views
ADD COMMENTlink modified 4.1 years ago by dariober9.9k • written 4.1 years ago by Rich30

Random sampling would be a good way to generate your "background" or expected rate of overlap. You might be careful how you generate your sample space, in that you might exclude sampling from certain areas (unmappable regions, say).

ADD REPLYlink written 4.1 years ago by Alex Reynolds27k

Thank you, Alex! Do you know any tool (maybe in BioPerl or etc.) that could generate it based on real human chromosomes?

ADD REPLYlink written 4.1 years ago by Rich30
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

To get bounds for, say, hg19, you could do something like:

$ wget -qO- | gunzip -c - | awk '{print $1"\t0\t"$2}' - | sort-bed - > hg19.bounds.bed

You could use BEDOPS to subtract out unmapped regions:

$ wget -qO- > hg19.gaps.bed
$ bedops --difference hg19.bounds.bed hg19.gaps.bed > hg19.mappable.bed

A sampler that goes through hg19.mappable.bed could be written in a scripting language of your choice. I don't think it would be too hard to sample uniform (or other distribution) starting points within the given, mappable genomic spaces, adding 50K to that point to generate a random interval. You could do read.table() in R, for instance, to bring in the mappable spaces to do sampling there. There are lots of options — too many to enumerate here.

Your script determines how many random intervals you sampled (10K, 100K, whatever). Spit all these random intervals out to a sorted BED file, and then do something like bedops --element-of 1 random_intervals.bed 5kb_sites.bed | wc -l to get counts of overlapping random intervals.

Divide the two numbers to get an expected rate of overlap. Repeat this to build a population of expected or background rates to compare against your observed rate. You might generate a z-score, for example, to indicate how near or far away your observed rate is from what is expected.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Alex Reynolds27k

Thank you, Alex! I generated starting positions using 'shuf' command in Unix and and then used BEDOPS. Didn't know about this software before. Thank you for telling about it.

ADD REPLYlink written 4.1 years ago by Rich30
gravatar for dariober
4.1 years ago by
WCIP | Glasgow | UK
dariober9.9k wrote:

Hi- Have a look at these Q&A:

Сalculating fold-enrichment of ChIP-seq peaks intersecting with promoters (vs. genome average)

Association between bed files - statistical significance

And this paper The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment.



ADD COMMENTlink written 4.1 years ago by dariober9.9k

Thank you for the helpful links!

ADD REPLYlink written 4.1 years ago by Rich30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1861 users visited in the last hour