If I wanted to binarize fractional methylation data given a WGBS data for all the chromosomes and read coverage, how would I go about doing it? Should I set a hard cutoff (above 0.6 fractional methylation for "methylated" and below for "unmethylated") or should the read coverage be taken into account somehow? Obviously, the binarization is not perfect, but I need it to run my computer simulation. Thanks in advance for the help.
If you use Bismark, there are alignment tools for calculating percentage methylation. I think there is a minimum cutoff parameter, but I don't think it actually matters because I'm pretty sure there are read counts for methylated and unmethylated nucleotides in the final result.
It is not ideal for whole-genome BS-Seq, but my COHCAP algorithm uses methylated and unmethylated thresholds like you described. So, perhaps you can take a look at the paper for analysis ideas (since I think it is a good choice for targeted BS-Seq analysis):
Here is my take on it:
The short answer is: yes; you can define a hard-threshold to binarize your methylation data and as far as I know, the majority of the methylation-related papers do this.
The long answer is: yes; you can define a threshold, but you should do this in a way that helps you explain your phenotype of interest, e.g. gene expression. In this sense, it is also important to know whether you want to work with probe-level or gene-level data.
Let's say you are working with probe-level data; then people are, most of the time, interested in the effect of methylation on transcript levels and this requires you to identify which probe is more informative for you for a given gene and what seems to be the best cut-off for the B-value (beta) that distinguishes the samples (from the normal ones) that have down-regulation in that gene -- and this threshold might be different for each gene (depending on the coverage, promoter sensitivity, CG content of that region, etc.). For example, you sometimes see hyper-methylated promoter regions (B ~ 1) for a gene that do not really show a differential regulation at all. In these cases, would it make sense to threshold the methylation data and call these probes/genes methylated? It depends on what you want to accomplish with your binary data.
I think whatever approach you use will be good as the field does not have a standard way to do things -- everybody seems to be going in his/her way nowadays. As long as you are aware of the artifact you might have in your pipeline, I think the simple binary approach might be the easiest to go, but it is not necessarily the best in terms of explaining biological mechanisms and phenotypic effects.
Oh and you might find the following TCGA guideline useful: https://confluence.broadinstitute.org/display/GDAC/Methylation+Preprocessor