Question

Interpreting Fractional Methylation Data

3

Entering edit mode

10.4 years ago

qliu2011 ▴ 40

If I wanted to binarize fractional methylation data given a WGBS data for all the chromosomes and read coverage, how would I go about doing it? Should I set a hard cutoff (above 0.6 fractional methylation for "methylated" and below for "unmethylated") or should the read coverage be taken into account somehow? Obviously, the binarization is not perfect, but I need it to run my computer simulation. Thanks in advance for the help.

methylation • 8.6k views

ADD COMMENT • link updated 10.4 years ago by B. Arman Aksoy ★ 1.2k • written 10.4 years ago by qliu2011 ▴ 40

score 4 · Answer 1 · 2013-12-03

Here is my take on it:

The short answer is: yes; you can define a hard-threshold to binarize your methylation data and as far as I know, the majority of the methylation-related papers do this.

The long answer is: yes; you can define a threshold, but you should do this in a way that helps you explain your phenotype of interest, e.g. gene expression. In this sense, it is also important to know whether you want to work with probe-level or gene-level data.

Let's say you are working with probe-level data; then people are, most of the time, interested in the effect of methylation on transcript levels and this requires you to identify which probe is more informative for you for a given gene and what seems to be the best cut-off for the B-value (beta) that distinguishes the samples (from the normal ones) that have down-regulation in that gene -- and this threshold might be different for each gene (depending on the coverage, promoter sensitivity, CG content of that region, etc.). For example, you sometimes see hyper-methylated promoter regions (B ~ 1) for a gene that do not really show a differential regulation at all. In these cases, would it make sense to threshold the methylation data and call these probes/genes methylated? It depends on what you want to accomplish with your binary data.

I think whatever approach you use will be good as the field does not have a standard way to do things -- everybody seems to be going in his/her way nowadays. As long as you are aware of the artifact you might have in your pipeline, I think the simple binary approach might be the easiest to go, but it is not necessarily the best in terms of explaining biological mechanisms and phenotypic effects.

Oh and you might find the following TCGA guideline useful: https://confluence.broadinstitute.org/display/GDAC/Methylation+Preprocessor

score 2 · Answer 2 · 2013-12-03

2

Entering edit mode

10.4 years ago

Charles Warden 8.2k

If you use Bismark, there are alignment tools for calculating percentage methylation. I think there is a minimum cutoff parameter, but I don't think it actually matters because I'm pretty sure there are read counts for methylated and unmethylated nucleotides in the final result.

http://www.bioinformatics.babraham.ac.uk/projects/bismark/Bismark_User_Guide.pdf

It is not ideal for whole-genome BS-Seq, but my COHCAP algorithm uses methylated and unmethylated thresholds like you described. So, perhaps you can take a look at the paper for analysis ideas (since I think it is a good choice for targeted BS-Seq analysis):

http://nar.oxfordjournals.org/content/41/11/e117.long

ADD COMMENT • link 10.4 years ago by Charles Warden 8.2k

0

Entering edit mode

Hi, so your paper quotes: "The CpG site analysis is based on the method described in Sproul et al. (44), where sites are defined as methylated if they show a percentage of methylation (beta) greater than a certain value (0.7 for cell line data, 0.3 for patient data) and sites are unmethylated if they have beta values <0.3 (by default)...We extended this algorithm to include a P-value and false-discovery rate [FDR, using the method of Benjamini and Hochberg (45)] value as cutoffs for differential expression. The method of P-value calculation varies based on the number of groups considered for the analysis (one group, two groups, three or more groups; Supplementary Table S2, Supplemental Methods)."

It appears that you are using a hard cutoff of 0.7 for considering a site as methylated. But then you state that if a site has a fractional methylation value of <0.3, it is not methylated. What happens to the values in between 0.3 and 0.7? Thanks for the help!

ADD REPLY • link 10.4 years ago by qliu2011 ▴ 40

1

Entering edit mode

When working with pretty clear cell line data, there are not a lot of sites with beta values between 0.3 and 0.7. Thus, you could call the intermediate values either ambiguous or hetrozygous (one methylated and one unmethylated allele). For the Illumina array data, I occasionally saw an intermediate "heterozygous" peak, but I usually only saw clear peaks > 0.7 and < 0.3. For BS-Seq, the distribution is different (but I think the bimodal peaks I saw were even sharper, making the intermediate methylation values less of a deal).

Another option is to consider a delta beta cutoff (where I would recommend something like 0.2). However, this doesn't meet your original criteria.

Although your signal distributions should look different for sites versus CpG islands. The discussion above was primarily for CpG site characterization prior to CpG Island analysis.

ADD REPLY • link 10.4 years ago by Charles Warden 8.2k