Question: Statistical Measure Of Biological Significance For Overlapping Genomic Regions
7
Dataminer2.6k wrote:

Hi!

Imagine, if you were given two datasets A and B, each consisting of genomic regions. And you were also given following information about each dataset.

Dataset A contains 83 genomic regions of interest. Dataset B contains 63 genomic regions of interest.

An overlap of Dataset A to Dataset B gives 36 genomic regions.

Now, I am interested in measuring the significance for this pairwise overlap, which is 36.

How can I measure this? any suggestions.

overlap statistics • 9.1k views
modified 6 weeks ago by rishi10 • written 8.2 years ago by Dataminer2.6k
1

See related post: http://biostar.stackexchange.com/questions/5501/how-do-you-calculate-if-two-sets-of-genomic-regions-overlap-significantly

12
brentp23k wrote:

If all you care about is the overlap and you're not worried about the size of the regions or anything about the characteristics of the rest of the genome, you can use the hypergeometric distribution. See this paper and an implementation here. If your 83 and 63 are drawn from a list of 25K possible genes, that would look like this:

``````./list_overlap_p.py 36 25000 83 63
1.42574840822e-11
``````

so the probability of those 2 gene sets sharing 36 genes by chance is quite small.

If you have information about the starts and stops in your (2) dataset(s), you can have a look at shuffleBed in BedTools and see this thread on the mailing list for some ideas.

Also, checkout `randomstats` in pybedtools by Ryan Dale. I haven't used this, but see the reference in that docstring.

1

Thank you :) !!!

1

@Stefano exactly, that's why there are 4 parameters. @SNPminer how many proteins are there for your organism? Or, if you're only dealing with a specific region of interest, how many total proteins from that region of interest? As @Stefano says, this will greatly affect result.

Note: you need to know the total size of the "population" where these datasets are coming from. if you had 100 in total, then I suspect 36 is not so impressive. 25000 instead...

Confusion, since I am not dealing with proteins and I do not know what to give in place of second value (as you have given 25000). I only have three values which I stated in my question. What shall I put in second column ????????????

well these are peaks called by a peak calling algorithm.

@SNPminer, in that case, it's probably safe to use the total number of proteins/genes for your organism.

Hi brentp, I am also working on a similar problem where I have a certain number of genes in the first dataset and a certain number of genes in the second dataset. I am interested in calculating the hypergeometric probability of the observed number of overlap between the two datasets. I have a population where these genes come from. I find Brent's python script very useful for it.

I am not clear on what the following notations mean in the context of my dataset: Could you please help? N: The number of items in the population. k: The number of items in the population that are classified as successes. n: The number of items in the sample. x: The number of items in the sample that are classified as successes. kCx: The number of combinations of k things, taken x at a time. h(x; N, n, k): hypergeometric probability - the probability that an n-trial hypergeometric experiment results in exactly x successes, when the population consists of N items, k of which are classified as successes.

Thanks very much!

1

if you tell what your data is, it will be easier to help you. normally, you have 2 sets of genes A, B. those may be differentially expressed genes or something. you want to know if they overlapped by chance. Those are drawn from a pool of 30,000 probes, that would be N. your k would be the number of genes shared between A and B.

Thanks so much! This is what I have figured out. Please correct me if I am wrong. Your assumptions about the data are correct.

N is the population size (in my data the total number of probes, ex 30000)

K is the number of success states in the population (in my data the number of genes in the first dataset)

n is the number of draws (in my data the number of genes in the second dataset)

k is the number of successes( in my data the number of shared genes between the two datasets)

Also, when the ran your python script for the same values, the result was different. Why would that be? Thanks!! ./hypergeometric.py 36 25000 83 63 7.19574400065e-11

Hi brentp, I have a few questions regarding the overlap.py script. 1. Is it possible for the script to result in a negative value? I got a negative value for one of my data calculations and it doesn't make sense for a probability to be negative. The data was : .list_overlap.py 5059204 6041633 5385248 5325904

1. Is the function phyper in R the equivalent of this script? If I did 1-phyper (36-1, 83, 25000-83, 63), is this the same as ./list_overlap_p.py 36 25000 83 63 ? I am asking because I am getting different results from those.

Thank you very much for your help!

Dear All,

I have a similar question, I am having some CNV data which I obtained from my exome data after running CNV algo on them. Now for my tumor and its reprogrammed cells I have found CNVs, I am having 97 CNV for my tumor and 223 CNVs for my clone and they overlap for 18 regions, this numbers are chr1:start-end coordinates, now if I want to calculate the significance of the overlap here what should I place in place of N, should it be the number of total total coordinates in a genome? in your example its genes where roughly the human genes characterized so far is 25000 so any gene overlap and its significance can be calculated based on that but can you give me idea about what should be the population in my case. How should I be able to calculate that? And then once I have the population I can consider the python scripts to calculate my overlap significance.

I would like to ask I found from this link http://www.nimblegen.com/downloads/annotation/hg19/ a file containing the potential CNVs that have been curated till date for hg19 from 20 CEU and 20 YRI HapMap research samples using a set of NimbleGen CGH arrays that contains approximately 42 million probes tiled across the genome. It contains roughly 8300 CNVs, so for my above question can I use the population of CNV as this and then calculate the significance of the overlap? This should be valid for my CNVs as well though they are from tumor and clones. Kindly give some suggestion.

Hi Brent (brentp) or Dataminer,

I need to perform a similar analysis. I have been trying to install scipy on my new server but it has almost been two hours without any success. I have an older version of scipy but that returns "nan" for the numbers below. Can anyone of you run the list_overlap_p.py script for me? Thanks in advance. I need to run the numbers below.

``./list_overlap_p.py 683 3431 1808 867``

I ended up using R's phyper function and calculated phyper(682, 1808, 1623, 867, lower.tail=FALSE) that gave me 1.311855e-74

9
Lyco2.3k wrote:

First, for a simple hypergeometric test (a.k.a. Chi-square test) it is not necessary to run a python program, there are good online servers for doing this, e.g. http://www.langsrud.com/fisher.htm (this server performs a 'Fisher's exact test' which is superior to Chi-square for small numbers).

The more crucial question that has to be addressed is the background. If I understand the previous answer(s) correctly, the respondents assume that your 'genome regions' are sampled from a fixed 'countable' pool of genome regions. If this is not the case, this task requires further consideration. For estimating these numbers, you have to think about what is the maximum of 'genome regions' that could have been found (is it determined by the number of features on an array, or by the number of markers that bracket the intervals? Then, you have to think about what degree of 'overlap' do you consider to be 'enough overlap' and what is the probability to find such an overlap in randomly picked 'genomic regions'.

These numbers are straightforward only if there is a defined number of possible 'genomic region' outcomes (and not a continuum) and if 'overlap' in you set means that the genomic regions are identical (and not just partially overlapping). If these conditions are not met, the problem can still be addressed but this depends on the exact nature of your data.

very informative and descriptive answer. Thank you, unfortunately I can accept only one answer and both of you (Brent and Lycopersicon) have helped me in constructing my answer.

0
rishi10 wrote:

For people still interested in this post, this might be helpful: http://bioconductor.org/packages/release/bioc/vignettes/regioneR/inst/doc/regioneR.html