Question: Check my work? Hyper geometric test on enrichment between iCLIP sites across experiments
1
gravatar for benformatics
2.8 years ago by
benformatics1.1k
ETH Zurich
benformatics1.1k wrote:

Hi I was hoping for some more feedback on this test I am trying to perform to check for enrichment at specific sites throughout the genome and my statistics background isn't great.


Here is the current setup:

Given: - genomic coordinates of iCLIP binding sites (single nucleotide position corresponding to the site of a crosslink) from two different proteins. Sample A and Sample B

Goal: - Researcher wants to put a p-value on whether there is a greater number of Sample A positions nearby to Sample B positions than you would expect to observe by chance. 60-nt bins were chosen for a biological reason related to the protein from Sample B.


Setting up the test:

Step 1: split genome in 60 nt bins (do both strands separately) and count the total number of bins --> total number of balls in the urn

Step 2: count the number of bins overlapping with one or more Sample A positions --> total number of white balls in the urn

Step 3: count the number of bins overlapping with one or more Sample B positions --> total number of balls drawn without replacement from the urn

Step 4: count the number of bins overlapping with one or more Sample A and Sample B position --> total number of white balls drawn without replacement from the urn


Does this seem like an acceptable test to do? Or is there a better test for this kind of scenario.

If anyone is interested this is how the p-value is generated in R for the test: 1-phyper(q=step 4, m= step 2, n= step 1 - step 2, k = step 3 )

hypergeometric enrichment • 1.0k views
ADD COMMENTlink modified 2.7 years ago by michael.ante3.3k • written 2.8 years ago by benformatics1.1k

Also have a look at the genometricorr R package.

ADD REPLYlink written 2.7 years ago by Alastair Kerr5.2k
1
gravatar for fanli.gcb
2.8 years ago by
fanli.gcb670
Los Angeles, CA
fanli.gcb670 wrote:

That seems like a fairly reasonable approach. This link has some other methods too: Association between bed files - statistical significance

Personally, I'd do a quick permutation test just to reassure myself (as well as have some visual representation for collaborator/PI). Something like count the number of overlapping bins in a million permutations and see where your true #overlaps falls in that null distribution.

ADD COMMENTlink written 2.8 years ago by fanli.gcb670
1
gravatar for michael.ante
2.7 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

Hi benformatics,

I also like your approach. I only would reduce the number in step 1, e.g. by using only annotated binding sites or the UTR areas. If there is a strong statistical dependency between the two proteins, you need to model that. You can use in that case a Monte Carlo sampling approach like in http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012432 in order to compute empirical p-values.

Cheers, Michael

ADD COMMENTlink written 2.7 years ago by michael.ante3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1338 users visited in the last hour