Question

Check my work? Hyper geometric test on enrichment between iCLIP sites across experiments

1

Entering edit mode

7.4 years ago

benformatics 3.9k

Hi I was hoping for some more feedback on this test I am trying to perform to check for enrichment at specific sites throughout the genome and my statistics background isn't great.

Here is the current setup:

Given: - genomic coordinates of iCLIP binding sites (single nucleotide position corresponding to the site of a crosslink) from two different proteins. Sample A and Sample B

Goal: - Researcher wants to put a p-value on whether there is a greater number of Sample A positions nearby to Sample B positions than you would expect to observe by chance. 60-nt bins were chosen for a biological reason related to the protein from Sample B.

Setting up the test:

Step 1: split genome in 60 nt bins (do both strands separately) and count the total number of bins --> total number of balls in the urn

Step 2: count the number of bins overlapping with one or more Sample A positions --> total number of white balls in the urn

Step 3: count the number of bins overlapping with one or more Sample B positions --> total number of balls drawn without replacement from the urn

Step 4: count the number of bins overlapping with one or more Sample A and Sample B position --> total number of white balls drawn without replacement from the urn

Does this seem like an acceptable test to do? Or is there a better test for this kind of scenario.

If anyone is interested this is how the p-value is generated in R for the test: 1-phyper(q=step 4, m= step 2, n= step 1 - step 2, k = step 3 )

hypergeometric enrichment • 2.0k views

ADD COMMENT • link updated 7.4 years ago by michael.ante ★ 3.8k • written 7.4 years ago by benformatics 3.9k

0

Entering edit mode

Also have a look at the genometricorr R package.

ADD REPLY • link 7.4 years ago by Alastair Kerr 5.3k

score 1 · Answer 1 · 2016-11-17

That seems like a fairly reasonable approach. This link has some other methods too: Association between bed files - statistical significance

Personally, I'd do a quick permutation test just to reassure myself (as well as have some visual representation for collaborator/PI). Something like count the number of overlapping bins in a million permutations and see where your true #overlaps falls in that null distribution.

score 1 · Answer 2 · 2016-11-18

Hi benformatics,

I also like your approach. I only would reduce the number in step 1, e.g. by using only annotated binding sites or the UTR areas. If there is a strong statistical dependency between the two proteins, you need to model that. You can use in that case a Monte Carlo sampling approach like in http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012432 in order to compute empirical p-values.

Cheers, Michael