Question: Interpreting Fractional Methylation Data
gravatar for qliu2011
5.4 years ago by
qliu201140 wrote:

If I wanted to binarize fractional methylation data given a WGBS data for all the chromosomes and read coverage, how would I go about doing it? Should I set a hard cutoff (above 0.6 fractional methylation for "methylated" and below for "unmethylated") or should the read coverage be taken into account somehow? Obviously, the binarization is not perfect, but I need it to run my computer simulation. Thanks in advance for the help.

methylation • 6.2k views
ADD COMMENTlink modified 5.4 years ago by B. Arman Aksoy1.2k • written 5.4 years ago by qliu201140
gravatar for Charles Warden
5.4 years ago by
Charles Warden6.6k
Duarte, CA
Charles Warden6.6k wrote:

If you use Bismark, there are alignment tools for calculating percentage methylation. I think there is a minimum cutoff parameter, but I don't think it actually matters because I'm pretty sure there are read counts for methylated and unmethylated nucleotides in the final result.

It is not ideal for whole-genome BS-Seq, but my COHCAP algorithm uses methylated and unmethylated thresholds like you described. So, perhaps you can take a look at the paper for analysis ideas (since I think it is a good choice for targeted BS-Seq analysis):

ADD COMMENTlink written 5.4 years ago by Charles Warden6.6k

Hi, so your paper quotes: "The CpG site analysis is based on the method described in Sproul et al. (44), where sites are defined as methylated if they show a percentage of methylation (beta) greater than a certain value (0.7 for cell line data, 0.3 for patient data) and sites are unmethylated if they have beta values <0.3 (by default)...We extended this algorithm to include a P-value and false-discovery rate [FDR, using the method of Benjamini and Hochberg (45)] value as cutoffs for differential expression. The method of P-value calculation varies based on the number of groups considered for the analysis (one group, two groups, three or more groups; Supplementary Table S2, Supplemental Methods)."

It appears that you are using a hard cutoff of 0.7 for considering a site as methylated. But then you state that if a site has a fractional methylation value of <0.3, it is not methylated. What happens to the values in between 0.3 and 0.7? Thanks for the help!

ADD REPLYlink written 5.4 years ago by qliu201140

When working with pretty clear cell line data, there are not a lot of sites with beta values between 0.3 and 0.7. Thus, you could call the intermediate values either ambiguous or hetrozygous (one methylated and one unmethylated allele). For the Illumina array data, I occasionally saw an intermediate "heterozygous" peak, but I usually only saw clear peaks > 0.7 and < 0.3. For BS-Seq, the distribution is different (but I think the bimodal peaks I saw were even sharper, making the intermediate methylation values less of a deal).

Another option is to consider a delta beta cutoff (where I would recommend something like 0.2). However, this doesn't meet your original criteria.

Although your signal distributions should look different for sites versus CpG islands. The discussion above was primarily for CpG site characterization prior to CpG Island analysis.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Charles Warden6.6k
gravatar for B. Arman Aksoy
5.4 years ago by
B. Arman Aksoy1.2k
New York, NY
B. Arman Aksoy1.2k wrote:

Here is my take on it:

The short answer is: yes; you can define a hard-threshold to binarize your methylation data and as far as I know, the majority of the methylation-related papers do this.

The long answer is: yes; you can define a threshold, but you should do this in a way that helps you explain your phenotype of interest, e.g. gene expression. In this sense, it is also important to know whether you want to work with probe-level or gene-level data.

Let's say you are working with probe-level data; then people are, most of the time, interested in the effect of methylation on transcript levels and this requires you to identify which probe is more informative for you for a given gene and what seems to be the best cut-off for the B-value (beta) that distinguishes the samples (from the normal ones) that have down-regulation in that gene -- and this threshold might be different for each gene (depending on the coverage, promoter sensitivity, CG content of that region, etc.). For example, you sometimes see hyper-methylated promoter regions (B ~ 1) for a gene that do not really show a differential regulation at all. In these cases, would it make sense to threshold the methylation data and call these probes/genes methylated? It depends on what you want to accomplish with your binary data.

I think whatever approach you use will be good as the field does not have a standard way to do things -- everybody seems to be going in his/her way nowadays. As long as you are aware of the artifact you might have in your pipeline, I think the simple binary approach might be the easiest to go, but it is not necessarily the best in terms of explaining biological mechanisms and phenotypic effects.

Oh and you might find the following TCGA guideline useful:

ADD COMMENTlink written 5.4 years ago by B. Arman Aksoy1.2k

I would like to distinguish the effect of methylation on gene expression of each specific gene. So, yes, probe-level data binarization might be the best way to go. (Binarize the methylation data for each gene differently.) If I were to do this, what ways might be best for picking the threshold for each gene?

ADD REPLYlink written 5.4 years ago by qliu201140

In that case, I think you don't need to binarize the methylation data at all. You can simply try to correlate the B value for a probe to the gene expression level of interest. As described in the TCGA guideline above, you should take the most anti-correlated one and when you do these for all gene expression vs corresponding methylation probe levels, you can then decide on the effect of methylation by looking at these correlation values.

ADD REPLYlink written 5.4 years ago by B. Arman Aksoy1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1784 users visited in the last hour