Question

Enrichment analysis using a random distribution for mean and standard deviation

2

Entering edit mode

9.7 years ago

jamespoweraid ▴ 20

I am confused about an approach to evaluate enrichment and was wondering if you could help me understand if what I am doing makes sense.

I have counts from regions that overlap histone markers in a dataset and I would like to know if histone markers are enriched in my dataset compared to a random set using a randomization approach.

I have created 1000 similar datasets and created counts for number of regions overlapping histones in these null datasets. In some cases, this distribution is normal, but in others it is not.

In cases where the distribution of the proportions from the null datasets is normal:

I can use these data to find the mean, standard deviation, and degrees of freedom and compare this distribution (mean, sd and df) to my observed count. Is this correct?

# "sim.null" is a normal distribution of counts of overlaps that I get from 1000 simulations (this is matched on my original dataset for some features)

sim.null= rnorm(sd=0.001, mean=0.01, n=1000)

# I would like to compare it with the counts I get from my dataset

observed = 0.0125

t = mean(sim.null)-observed / (sd(sim.null)/sqrt(1000))

2*pt(-abs(t),df=999)

# Is this the same as doing this?

t.test(sim.null, mu=observed, alternative="two.sided")$p.value

In cases where the null dataset is not normal: would maybe a Fisher exact test be appropriate?

Thank you very much, any suggestions are very appreciated!

sampling R distribution statistics • 3.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

0

Entering edit mode

Actually shouldn't the empirical p-value be:

length(sim.null>=0.0125)/1000

And not

sum(sim.null>=0.0125)/1000

since we want the number of occurrences more or as extreme as the observed?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

1

Entering edit mode

My version counts the occurrences, yours doesn't :)

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Devon Ryan 105k

0

Entering edit mode

Yes sorry! I meant to write length(sim.null[sim.null>=observed]) which gives the same answer and yours is better!

Thanks so much for your help!!

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

0

Entering edit mode

From the answers I understood that, under a normal distribution, these should all be giving a similar finding, is this correct?

t= (mean(sim.null)-0.0125)/sd(sim.null)
# P-value assuming a normal distribution
pvalue1= 2*pt(-abs(t),df=999)
# This should be the same as doing this?
pvalue2=t.test(sim.null, mu=0.0125, alternative="two.sided")$p.value
# P-value without distribution assumptions 
pvalue3= sum( abs( sim.null )>=0.0125)/1000

I don't think the t.test conversion is correct because it gives me very different values compared to the p-value computed from normal distribution. I don't know why, but ignoring the t.test then and I could compute 95% CI directly I hope with this:

c(mean(sim.null) - (1.96)*(sd(sim.null)), mean(sim.null) + (1.96)*(sd(sim.null)))
[1] 0.008038979 0.011916992

Which does not include the observed value 0.0125, so it will be significant. I hope this is correct..

Thank you so much for helping me understand, this is extremely helpful and hard to find on text books (well for who doesn't know what to look for as a non-statistician)...

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

1

Entering edit mode

9.7 years ago

Lemire ▴ 940

This: t=(mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000)) can be made as large as you want simply by creating more null datasets (except for sqrt(1000), all other terms will remain more or less the same). That's because your denominator estimates how mean(sim.null) is expected to vary from one set of 1000 null replicates to another set of 1000 null replicates. It does not measure how mean(sim.null) varies from _one_ replicate to another (which is what you want). Drop the srqt(1000). Or, much simpler since you don't need to verify the underlying assumptions, stick with the empirical p in all cases. But do it two-sided though: sum( abs( sim.null )>=0.0125)/1000.

ADD COMMENT • link 9.7 years ago by Lemire ▴ 940

0

Entering edit mode

Hi Lemire,

Thank you! I am a bit confused though now how to run the same using a t-test? I'll summarise what has been said in the answers...

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

Ram · Accepted Answer · 2015-10-26

3

Entering edit mode

9.7 years ago

Devon Ryan 105k

What you meant to write was:

Tstat = (mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000))

The p-value will then essentially match (the t.test will tell you <2e-16, which is the smallest value it'll output).

For non-normal distributions, either use the appropriate distribution, or just use an empirical distribution (the latter is required for messy distributions):

sum(sim.null>=0.0125)/1000

Using theoretical distributions (e.g., the normal distribution) will give much nicer p-values if they're appropriate for your dataset.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you for your help Devon!

Ah sorry I forgot the brackets!

I also need the CI so I guess better to go with the t.test. When I then try to get the p-values:

t.test(sim.null, mu=observed, alternative="two.sided")$conf.int
t.test(sim.null, mu=observed, alternative="two.sided")$p.value

I get zeroes since it looks like p-value < 2.2e-16 so this difference is highly significant. Probably not but Is there any way I could get a precise p-value as we are used to in genomic analysis?

About the non-normal distribution, thank you for explaining this! Using your formula I can get p-values directly from the empirical distribution, could I get a CI for this as well? Thank you!!

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

0

Entering edit mode

Well, the "exact" value is 0, assuming one includes floating point arithmetic issues when calculating things.

For confidence intervals in this situation, have a look here for some discussion (N.B., there might well be an R package for this, I've never looked). I suspect there are related posts on the same site. I've not thought about this enough to provide further insight.

ADD REPLY • link 9.7 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks again! One last question... In my real data, if I apply both the t.test and the empirical test, sometimes I get discordant results, i.e.

empirical_p= 0.006
t.test_p= 0.707

Which one would we trust more?

Thanks!!

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by jamespoweraid ▴ 20

1

Entering edit mode

If your background distribution is trully normal, then the t-test would be more reliable. Empirical distributions should only be used when you have no other choice.

ADD REPLY • link 9.7 years ago by Devon Ryan 105k