Question: Which statistics should be applied to determine whether a region is differentially methylated or not?
4
dustar1986330 wrote:

Hi,

I've got bisulfite-sequencing data for two differentiation stages. The raw data was mapped using Bismark. For each CpG site, the methylation ratio was marked as "A/B" (A methylated reads vs B unmethylated reads for this site).

Now I want to compare the overall methyaltion of a certain region. Assuming this region contains 3 CpG sites, the methylated/unmethylated ratio  of each site in the two stages are 45/60, 4/10, 5/15 and 14/65, 3/9, 7/20  respectively.

I think there are two ways to calculate the overall methylation rate of the region:

Method 1: total amount of methylated reads / total number of reads within the region

methyaltion rate of stage 1 = (45+4+5)/(45+60+4+10+5+15) = 54/139 = 0.39

methyaltion rate of stage 2 = (14+3+7)/(14+65+3+9+7+20) = 24/118 = 0.20

Method 2: average the methylation rate of each CpG site within the region

methyaltion rate of stage 1 = (45/(45+60) + 4/(4+10) + 5/(5+15))/3 = 0.32

methyaltion rate of stage 2 = (14/(14+65) + 3/(3+9) + 7/(7+20))/3 = 0.22

For method 1, I can then calculate the confidence inteval and the P value to know whether 54 out of 139 is significantly different from 24 out of 118.

However, the methylation rate calculated by method 1 is biased toward the site with higher reads coverage.

Methold 2 seems to be more robust to indicate the atucal methylation rate of the region.

But I don't know which statistics test should I use for method 2 to know the significance.

statistics bismark bs-seq • 3.7k views
modified 6.2 years ago • written 6.2 years ago by dustar1986330
5
Devon Ryan95k wrote:

From context I gather that you lack biological replicates, which is unfortunate but still rather common. As you mentioned, "method 1", which is basically summing across CpGs and performing a Fisher's test is less than ideal, since it will miss interesting cases (using your earlier nomenclature, think of 50/0, 0/50 in sample1 and 25/25, 25/25 in sample 2, they're very different but this method would miss that).

Method 2 lends itself to a weighted paired t-test, which you can do in R. This method is, of course, also not ideal, since you give a weight to each CpG, but if coverage is drastically different between the two samples at a CpG then you start running into problems (there are ways around that, but they'd be moderately annoying).

What would likely be a nice method in this case would be to perform a Fisher's test on the individual CpGs between the samples and then spatially correct the p-values (the original paper from Benjamini & Heller is here but you can find an implementation in the BiSeq bioconductor package, which you could also just directly use to make your life easier since it's actually targeted at RRBS or other targeted BS-seq datasets). You could also use the SLIM model for p-value adjustment, such as is used in methylKit. Somewhat similarly, you could also first use a smoothing method (see BSseq as an example) and then do a paired t-test, though I wouldn't recomment that for your use case.

There are some additional possibilities, but I'd recommend just using BiSeq and being done with it.

Thanks for your quick and detailed reply, Devon. This is really irradiative to me. I've read the mannual of BiSeq and methylKit. MethylKit seems focusing more on the comparison between simple sites.

BiSeq could give what I need. However, what I have is the whole genome BS-seq data instead of RRBS. I think it's still OK as I've got the CpG methylation counts from Bismark, though the depth might not as deep as RRBS. Am I right?

1

Yeah, BiSeq will work for WGBS as well. You probably want to filter out sites without at least 4x coverage. Your other option for WGBS would be BSseq (mentioned above) and just doing the paired T-test in regions of interest (I'm not a bit fan of the smoothing methods myself, but that's another story), should you have some.