Question: Enrichment analysis using a random distribution for mean and standard deviation
2
jamespoweraid20 wrote:

I am confused about an approach to evaluate enrichment and was wondering if you could help me understand if what I am doing makes sense.

I have counts from regions that overlap histone markers in a dataset and I would like to know if histone markers are enriched in my dataset compared to a random set using a randomization approach.

I have created 1000 similar datasets and created counts for number of regions overlapping histones in these null datasets. In some cases, this distribution is normal, but in others it is not.

In cases where the distribution of the proportions from the null datasets is normal:

I can use these data to find the mean, standard deviation, and degrees of freedom and compare this distribution (mean, sd and df) to my observed count. Is this correct?

# "sim.null" is a normal distribution of counts of overlaps that I get from 1000 simulations (this is matched on my original dataset for some features)

` sim.null= rnorm(sd=0.001, mean=0.01, n=1000)`

# I would like to compare it with the counts I get from my dataset

`observed = 0.0125`
```  t = mean(sim.null)-observed / (sd(sim.null)/sqrt(1000))
```
`  2*pt(-abs(t),df=999)`
```# Is this the same as doing this?

t.test(sim.null, mu=observed, alternative="two.sided")\$p.value```

In cases where the null dataset is not normal: would maybe a Fisher exact test be appropriate?

Thank you very much, any suggestions are very appreciated!

modified 4.3 years ago • written 4.3 years ago by jamespoweraid20

Actually shouldn't the empirical p-value be:

``````length(sim.null>=0.0125)/1000
``````

And not

``````sum(sim.null>=0.0125)/1000
``````

since we want the number of occurrences more or as extreme as the observed?

1

My version counts the occurrences, yours doesn't :)

Yes sorry! I meant to write `length(sim.null[sim.null>=observed])` which gives the same answer and yours is better!

Thanks so much for your help!!

From the answers I understood that, under a normal distribution, these should all be giving a similar finding, is this correct?

``````t= (mean(sim.null)-0.0125)/sd(sim.null)
# P-value assuming a normal distribution
pvalue1= 2*pt(-abs(t),df=999)
# This should be the same as doing this?
pvalue2=t.test(sim.null, mu=0.0125, alternative="two.sided")\$p.value
# P-value without distribution assumptions
pvalue3= sum( abs( sim.null )>=0.0125)/1000
``````

I don't think the t.test conversion is correct because it gives me very different values compared to the p-value computed from normal distribution. I don't know why, but ignoring the t.test then and I could compute 95% CI directly I hope with this:

``````c(mean(sim.null) - (1.96)*(sd(sim.null)), mean(sim.null) + (1.96)*(sd(sim.null)))
 0.008038979 0.011916992
``````

Which does not include the observed value 0.0125, so it will be significant. I hope this is correct..

Thank you so much for helping me understand, this is extremely helpful and hard to find on text books (well for who doesn't know what to look for as a non-statistician)...

3
Devon Ryan94k wrote:

What you meant to write was:

``````Tstat = (mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000))
``````

The p-value will then essentially match (the t.test will tell you <2e-16, which is the smallest value it'll output).

For non-normal distributions, either use the appropriate distribution, or just use an empirical distribution (the latter is required for messy distributions):

``````sum(sim.null>=0.0125)/1000
``````

Using theoretical distributions (e.g., the normal distribution) will give much nicer p-values if they're appropriate for your dataset.

Thank you for your help Devon!

Ah sorry I forgot the brackets!

I also need the CI so I guess better to go with the t.test. When I then try to get the p-values:

``````t.test(sim.null, mu=observed, alternative="two.sided")\$conf.int
t.test(sim.null, mu=observed, alternative="two.sided")\$p.value
``````

I get zeroes since it looks like p-value < 2.2e-16 so this difference is highly significant. Probably not but Is there any way I could get a precise p-value as we are used to in genomic analysis?

About the non-normal distribution, thank you for explaining this! Using your formula I can get p-values directly from the empirical distribution, could I get a CI for this as well? Thank you!!

Well, the "exact" value is 0, assuming one includes floating point arithmetic issues when calculating things.

For confidence intervals in this situation, have a look here for some discussion (N.B., there might well be an R package for this, I've never looked). I suspect there are related posts on the same site. I've not thought about this enough to provide further insight.

Thanks again! One last question... In my real data, if I apply both the t.test and the empirical test, sometimes I get discordant results, i.e.

``````empirical_p= 0.006
t.test_p= 0.707
``````

Which one would we trust more?

Thanks!!

1

If your background distribution is trully normal, then the t-test would be more reliable. Empirical distributions should only be used when you have no other choice.

1
Lemire500 wrote:
This: t=(mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000)) can be made as large as you want simply by creating more null datasets (except for sqrt(1000), all other terms will remain more or less the same). That's because your denominator estimates how mean(sim.null) is expected to vary from one set of 1000 null replicates to another set of 1000 null replicates. It does not measure how mean(sim.null) varies from _one_ replicate to another (which is what you want). Drop the srqt(1000). Or, much simpler since you don't need to verify the underlying assumptions, stick with the empirical p in all cases. But do it two-sided though: sum( abs( sim.null )>=0.0125)/1000.

Hi Lemire,

Thank you! I am a bit confused though now how to run the same using a t-test? I'll summarise what has been said in the answers...