Question

Statistical test for DMR annotation?

0

Entering edit mode

9.3 years ago

Amit Lavon ▴ 10

Hello friends,

I see that it was discussed here: A: Dmr (Differentially Methylated Regions) Identification Software but I would like to dig a little deeper into that, because I couldn't find a satisfying answer yet.

So - what statistical test would you choose for DMR (differentially methylated regions) annotation? Meaning you have a 2X2 table with column labels `WT` and `mutant`, and row labels `methylated` and `not methylated`, each cell has a count for a single region. You need to test whether methylation is dependent on the mutation.

I see that `methylkit` uses Fisher's Exact Test, but that test doesn't make sense to me. Why would DMR's behave hyper-geometrically? This assumes that the background set from which you sample is finite, right? And that's not the case with methylation - you can (theoretically) sample as much as you want, like coin flipping.

Am I right? What test would you use?

Thanks a lot, Amit

statistics methylation DMR • 4.7k views

ADD COMMENT • link updated 6.7 years ago by jordi • 0 • written 9.3 years ago by Amit Lavon ▴ 10

1

Entering edit mode

If the no-replacement aspect of Fisher's test is what you don't like then just do a binomial test instead. Having said that, the two approach each other with increasing N. Having said that, Charles' answer makes much more sense than a Fisher's or binomial test.

ADD REPLY • link 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon.

What do you think is the appropriate test for DMRs with a fixed-size window?

Amit

ADD REPLY • link 9.3 years ago by Amit Lavon ▴ 10

Ram · Answer 1 · 2014-12-25

I think that there are two types of DMR calculations: those with predefined region boundaries and those without predefined region boundaries.

If you have a predefined window (such as pre-defined regions of interest on the 450k array, targeted BS-Seq, or any sliding-window based analysis), I think the main trick is the summarization (at least that is my opinion). For example, COHCAP will either average the signal across CpG sites or CpG islands, and then use a simple statistical test like an ANOVA on the continuous signal (in addition to using additional filters to try and reflect the fact that the original signal can likely be thought of as a discrete variable where each CpG site is either homozygous methylated, homozygous unmethylated, or heterozygyous). methylKit and IMA also fall in this category. So, the short answer is that you may be able to use one of those other tools (or a similar strategy), but I think there people out there that are statisfied with the methylKit results.

DMR tools without predefined boundaries (such as bumphunter in the minfi package or ChAMP) are a totally different beast. A Fisher's Exact Test is unquestionably inappropriate in this situation.

If it helps, there are some script templates and limited benchmarks for a few such programs:

http://www.nature.com/protocolexchange/protocols/2965#/introduction

http://sourceforge.net/projects/cohcap/files/Protocol_Exchange_Example.zip/download

However, the original question was specifically for WGBS data (whereas the links above are for 450k data). Here, methylKit and bsseq are the main options that I know about. MethylSig is another option that I have heard about but not yet tried:

http://sartorlab.ccmb.med.umich.edu/node/17

score 0 · Answer 2 · 2017-08-01

Look at the math in informME:

Jenkinson, G., Pujadas, E., Goutsias, J., & Feinberg, A. P. (2017). Potential energy landscapes identify the information-theoretic nature of the epigenome. Nat Genet, 49(5), 719–729. Retrieved from http://dx.doi.org/10.1038/ng.3811

All the other tools do not account for correlation, or the closer they get is using some sort of smoothing technique. By assuming independence, they are not capable to control the false positive rate. In addition, differences in methylation do not necessarily have to be related to differences in mean. It could be the case that the probability distributions for a given region of the methylation state (binary vector of certain length) have the same mean but completely different shapes (a bimodal and a unimodal distributions can have same mean).