10 months ago by
It sounds like you need to familiarize yourself a bit more with the data and data formats for RRBS.
The QC protocol and a paper by the Bismark authors will help you understand the biases you should be looking out for and the Bismark website itself also offers a couple of sample HTML output files to which you could compare your results.
To understand what CHG* etc. stand for, you'll find plenty of information in the Bismark publication and the Bismark user guide, that includes, for example, the following paragraph:
Bisulfite treatment of DNA and subsequent PCR amplification can give rise to four (bisulfite converted) strands for a given locus. Depending on the adapters used, BS-Seq libraries can be constructed in two different ways:
- If a library is directional, only reads which are (bisulfite converted) versions of the original top strand (OT) or the original bottom strand (OB) will be sequenced. Even though the strands complementary to OT (CTOT) and OB (CTOB) are generated in the BS-PCR step they will not be sequenced as they carry the wrong kind of adapter at their 5’-end. By default, Bismark performs only 2 read alignments to the OT and OB strands, thereby ignoring alignments coming from the complementary strands as they should theoretically not be present in the BS-Seq library in question.
- Alternatively, BS-Seq libraries can be constructed so that all four different strands generated in the BS-PCR can and will end up in the sequencing library with roughly the same likelihood. In this case all four strands (OT, CTOT, OB, CTOB) can produce valid alignments and the library is called non- directional. Specifying --non_directional instructs Bismark to use all four alignment outputs.
To summarise again: alignments to the original top strand or to the strand complementary to the original top strand (OT and CTOT) will both yield methylation information for cytosines on the top strand. Alignments to the original bottom strand or to the strand complementary to the original bottom strand (OB and CTOB) will both yield methylation information for cytosines on the bottom strand, i.e. they will appear to yield methylation information for G positions on the top strand of the reference genome.
Once you're confident you understand the output of Bismark and the quality of your data, you may want to check out packages for downstream analyses, such as methylKit. Good overviews of methylation analyses pitfalls and tools can be found here and here and here.