Question

Quantifying DNA methylation from Bisulfite-Seq Data

4

Entering edit mode

10.9 years ago

Ali ▴ 140

Sodium Bisulfite Treatment is the gold standard for measuring the level of DNA methylation, it converts the unmethylated cytosines to uracils - which convert to thymines after PCR - but keeps methylated cytosines unchanged.

Here is my question: Suppose the DNA fragment (Methylated cytosines are in upper case: C, and unmethylated cytosines in lower case: c)

5' ACGATGc 3' (Top strand)                          3' TGCTAcG 5' (Bottom strand)

After bisulfite treatment we will have:

5' ACGATGT 3' (Top strand)                         3' TGCTATG 5' (Bottom strand)

And after PCR there each of top and bottom strands will be changed to a double-stranded DNA like below. The strands 1 and 2 are complementary, made from the Top Strand above, and strands 3 and 4 are made from the bottom strand

5' ACGATGT 3 (1, Top strand, forward)            3' TGCTATG 5' (3, Bottom strand, forward)
3' TGCTACA 5' (2, Top strand, reverse)            5' ACGATAC 3' (4, Bottom strand, reverse)

Now we have 4 different strands which align to the same genomic location (either forward or reverse strands). But the problem is that each of them makes a different measurement of DNA methylation. For instance in the sequence (1): ACGATGT the the last base is T meaning an unmethylated cytosines, which is correct. However in the strand (4) ACGATAC the last letter is C that means a methylated cytosine, which is a wrong assumption.

How to infer the correct methylation status of each base according to the 4 different reads?

dna-methylation epigenetics bisulfite bis-seq • 3.2k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Ali ▴ 140

score 2 · Answer 1 · 2014-09-03

This in the reason that those of us who have written BS-seq aligners have more grey hair than we should.

The answer to this comes from the orientation of a read after alignment and what conversions one does to it and to the genome to get it to align. In short, if you in silico convert a read C->T (I'll just use single-end examples) and it aligns with a forward orientation to a C->T converted genome, then it originated from the original top strand. If it aligned with a reverse orientation to the G->A converted genome, then it came from the original bottom strand. The other two strands (we generally call these "complementary to the original top" and "complementary to the original bottom") follow similarly. I recall that Felix Krueger has a nice illustration of all of this in the RRBS guide for Bismark (it's the same for WGBS and RRBS, have a look starting at page 6 I think).

Once you know from which strand a read arose, you can then determine what residue it's giving information for.