Probability of overlap between sequences?
2
0
Entering edit mode
9.7 years ago
Rich ▴ 40

Hi,

I'm sorry if this question is too naive but maybe you can advise me the right way to solve a problem.

I have a 50 kb region on chromosome 6 (length 171,115,067 bp). It turned out to be, that this region completely overlaps with one specific site (5 kb). However, there are about 250 such specific sites across the entire chr 6.

How can I calculate chances that such overlap between my 50 kb region and one of such 5 kb sites happens just randomly? Is there any simple formula? Or I need to generate randomly 10 or 100 thousands of 50 kb regions from chr 6 and then calculate how many times such randomly generated regions overlap with the 5 kb sites?

And if the later way is the right one is there any tool that could generate such sequences from the given human chromosome?

Thank you!

sequencing probability • 3.2k views
ADD COMMENT
0
Entering edit mode

Random sampling would be a good way to generate your "background" or expected rate of overlap. You might be careful how you generate your sample space, in that you might exclude sampling from certain areas (unmappable regions, say).

ADD REPLY
0
Entering edit mode

Thank you, Alex! Do you know any tool (maybe in BioPerl or etc.) that could generate it based on real human chromosomes?

ADD REPLY
3
Entering edit mode
9.7 years ago

To get bounds for, say, hg19, you could do something like:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/chromInfo.txt.gz | gunzip -c - | awk '{print $1"\t0\t"$2}' - | sort-bed - > hg19.bounds.bed

You could use BEDOPS to subtract out unmapped regions:

$ wget -qO- https://gist.githubusercontent.com/alexpreynolds/6126976/raw/4bfdd68c0cc45c4c1b0643ea16c8630364d2f316/hg19.gaps.bed > hg19.gaps.bed
$ bedops --difference hg19.bounds.bed hg19.gaps.bed > hg19.mappable.bed

A sampler that goes through hg19.mappable.bed could be written in a scripting language of your choice. I don't think it would be too hard to sample uniform (or other distribution) starting points within the given, mappable genomic spaces, adding 50K to that point to generate a random interval. You could do read.table() in R, for instance, to bring in the mappable spaces to do sampling there. There are lots of options - too many to enumerate here.

Your script determines how many random intervals you sampled (10K, 100K, whatever). Spit all these random intervals out to a sorted BED file, and then do something like bedops --element-of 1 random_intervals.bed 5kb_sites.bed | wc -lto get counts of overlapping random intervals.

Divide the two numbers to get an expected rate of overlap. Repeat this to build a population of expected or background rates to compare against your observed rate. You might generate a z-score, for example, to indicate how near or far away your observed rate is from what is expected.

ADD COMMENT
1
Entering edit mode

Thank you, Alex! I generated starting positions using 'shuf' command in Unix and and then used BEDOPS. Didn't know about this software before. Thank you for telling about it.

ADD REPLY
0
0
Entering edit mode

Thank you for the helpful links!

ADD REPLY

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6