Enrichment determination from random sampling
1
4
Entering edit mode
9.6 years ago

Hi all,

I would like to know if my approach makes sense or not.

I have a count of histone mark peaks (Encode BigWig files) in an 80Kb window from hg19.

I have sampled 1000 positions (80Kb in length) of hg19 and generated a null empirical distribution based on the counts of peaks from each of the 1000 positions. All looks good and we have a normal looking distribution.

Now to check if my original 80Kb window is significantly enriched I can simply check to see if my observation counts falls inside or outside of the 95% of the null distribution count data by looking at the mean and standard deviation of the null distribution (95%-99% rule). I would also like to get an approximate P-Value so this is where I am slightly unsure. I think I can do two of the following depending on a one tail or two tail test:

One Tail Test: Get the total number of observations from the null distribution that are greater than or equal to my observation and divide by 1000.

Two Tail Test: Subtract my observation count from the mean of null distribution count and then take the total number of observations greater than or equal to this absolute value difference and divide this by 1000.

Conformation or Suggestions will be greatly appreciated!

Distribution sampling R Statisitics • 2.6k views
3
Entering edit mode
9.6 years ago

Yes, that's how it should be done. In both cases you're testing for the fraction more extreme than what you observed. In one case, you only care about "more extreme in one direction" (a 1-tailed test) while in the other you care about "more extreme in either direction" (a 2-tailed test).

0
Entering edit mode

Great, thanks for the confirmation!

0
Entering edit mode

I have a similar issue and I would like to go with the Two tail test, but shouldn't the second point also divide by the standard deviation before testing?

t = [ mean.null - my.observation.count ] / [ sd.null / sqrt( 1000 ) ]

1
Entering edit mode

No, we're not computing something for a T-test. We're directly computing the p-value from an empirical background distribution. When computing things like a T-statistic, one needs to incorporate the standard deviation to compute what, in this case, has already been empirically observed.