Question: Enrichment analysis using a random distribution for mean and standard deviation
2
gravatar for jamespoweraid
3.5 years ago by
United States
jamespoweraid20 wrote:

I am confused about an approach to evaluate enrichment and was wondering if you could help me understand if what I am doing makes sense.

I have counts from regions that overlap histone markers in a dataset and I would like to know if histone markers are enriched in my dataset compared to a random set using a randomization approach.

I have created 1000 similar datasets and created counts for number of regions overlapping histones in these null datasets. In some cases, this distribution is normal, but in others it is not. 

In cases where the distribution of the proportions from the null datasets is normal:

I can use these data to find the mean, standard deviation, and degrees of freedom and compare this distribution (mean, sd and df) to my observed count. Is this correct?

# "sim.null" is a normal distribution of counts of overlaps that I get from 1000 simulations (this is matched on my original dataset for some features)

 sim.null= rnorm(sd=0.001, mean=0.01, n=1000)

# I would like to compare it with the counts I get from my dataset

observed = 0.0125
  t = mean(sim.null)-observed / (sd(sim.null)/sqrt(1000))
  2*pt(-abs(t),df=999)
# Is this the same as doing this?

  t.test(sim.null, mu=observed, alternative="two.sided")$p.value

 

In cases where the null dataset is not normal: would maybe a Fisher exact test be appropriate?

Thank you very much, any suggestions are very appreciated!

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
3
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

What you meant to write was:

Tstat = (mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000))

The p-value will then essentially match (the t.test will tell you <2e-16, which is the smallest value it'll output).

For non-normal distributions, either use the appropriate distribution, or just use an empirical distribution (the latter is required for messy distributions):

sum(sim.null>=0.0125)/1000

Using theoretical distributions (e.g., the normal distribution) will give much nicer p-values if they're appropriate for your dataset.

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Devon Ryan89k

Thank you for your help Devon! 

Ah sorry I forgot the brackets!

I also need the CI so I guess better to go with the t.test. When I then try to get the p-values:

t.test(sim.null, mu=observed, alternative="two.sided")$conf.int
t.test(sim.null, mu=observed, alternative="two.sided")$p.value

I get zeroes since it looks like p-value < 2.2e-16 so this difference is highly significant. Probably not but Is there any way I could get a precise p-value as we are used to in genomic analysis?

About the non-normal distribution, thank you for explaining this! Using your formula I can get p-values directly from the empirical distribution, could I get a CI for this as well? Thank you!!

 

 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20

Well, the "exact" value is 0, assuming one includes floating point arithmetic issues when calculating things.

For confidence intervals in this situation, have a look here for some discussion (N.B., there might well be an R package for this, I've never looked). I suspect there are related posts on the same site. I've not thought about this enough to provide further insight.

ADD REPLYlink written 3.5 years ago by Devon Ryan89k

Thanks again! One last question... In my real data, if I apply both the t.test and the empirical test, sometimes I get discordant results, i.e.

empirical_p= 0.006

t.test_p= 0.707

Which one would we trust more?

Thanks!!

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
1

If your background distribution is trully normal, then the t-test would be more reliable. Empirical distributions should only be used when you have no other choice.

ADD REPLYlink written 3.5 years ago by Devon Ryan89k
1
gravatar for Lemire
3.5 years ago by
Lemire380
Canada
Lemire380 wrote:
This: t=(mean(sim.null)-0.0125)/(sd(sim.null)/sqrt(1000)) can be made as large as you want simply by creating more null datasets (except for sqrt(1000), all other terms will remain more or less the same). That's because your denominator estimates how mean(sim.null) is expected to vary from one set of 1000 null replicates to another set of 1000 null replicates. It does not measure how mean(sim.null) varies from _one_ replicate to another (which is what you want). Drop the srqt(1000). Or, much simpler since you don't need to verify the underlying assumptions, stick with the empirical p in all cases. But do it two-sided though: sum( abs( sim.null )>=0.0125)/1000.
ADD COMMENTlink written 3.5 years ago by Lemire380

Hi Lemire, 

Thank you! I am a bit confused though now how to run the same using a t-test? I'll summarise what has been said in the answers...

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
0
gravatar for jamespoweraid
3.5 years ago by
United States
jamespoweraid20 wrote:

Actually shouldn't the empirical p-value be:

length(sim.null>=0.0125)/1000

And not

sum(sim.null>=0.0125)/1000

since we want the number of occurrences more or as extreme as the observed?

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
1

My version counts the occurrences, yours doesn't :)
 

ADD REPLYlink written 3.5 years ago by Devon Ryan89k

Yes sorry! I meant to write  length(sim.null[sim.null>=observed]) which gives the same answer and yours is better!

Thanks so much for your help!!

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
0
gravatar for jamespoweraid
3.5 years ago by
United States
jamespoweraid20 wrote:
From the answers I understood that, under a normal distribution, these should all be giving a similar finding, is this correct? 

  t= (mean(sim.null)-0.0125)/sd(sim.null)
  # P-value assuming a normal distribution
  pvalue1= 2*pt(-abs(t),df=999)
  # This should be the same as doing this?
  pvalue2=t.test(sim.null, mu=0.0125, alternative="two.sided")$p.value
  # P-value without distribution assumptions 
  pvalue3= sum( abs( sim.null )>=0.0125)/1000

 

I don't think the t.test conversion is correct because it gives me very different values compared to the p-value computed from normal distribution. I don't know why, but ignoring the t.test then and I could compute 95% CI directly I hope with this:

c(mean(sim.null) - (1.96)*(sd(sim.null)), mean(sim.null) + (1.96)*(sd(sim.null)))

[1] 0.008038979 0.011916992

Which does not include the observed value 0.0125, so it will be significant. I hope this is correct..

Thank you so much for helping me understand, this is extremely helpful and hard to find on text books (well for who doesn't know what to look for as a non-statistician)...

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by jamespoweraid20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour