Question

Difference in Bismark output methylation call files and coverage files

1

Entering edit mode

3.8 years ago

linelr ▴ 40

Hi biostars!

I am working with RRBS data and have used Bismark for methylation calls.

I have compared the two output files with cytosines in a CpG-context and the coverage files (which only consider cytosines in a CpG context), but I can´t get my head around whether they should have the similar number of cytosines. For my data, there are 10 times more cytosines in the CpG-context files (giving me info if the cytosine is methylated or not) than in the coverage files. I guess the coverage files do not report cytosines that are not covered by any reads, but I still think it is a little weird that the difference between the number of cytosine is that great. This is probably due to some lack of understanding on my part, so I really do appreciate any attempt at explaining what is going on here for me

Thanks in advance

Best, Line

RRBS Bismark DNA-methylation • 3.7k views

ADD COMMENT • link updated 2.6 years ago by Papyrus ★ 2.9k • written 3.8 years ago by linelr ▴ 40

score 1 · Answer 1 · 2020-06-23

1

Entering edit mode

3.8 years ago

Papyrus ★ 2.9k

It seems you are comparing the "genome-wide cytosine report" files and the "coverage" files outputted by Bismark (you can probably recognize them because the cytosine report files usually contain a "CpG_report" string and the coverage files usually contain "cov" in their names).

By "number" of cytosines I suppose you're counting the lines in each file. It is as you say, the cytosine report contains the coordinates for all of the genomic cytosines (moreover, it provides strand-specific information so a typical human RRBS should have around 58 million lines). This is not the case for the coverage file so your differences are not unexpected. For example, I'm currently handling RRBS data for which I can have ~5-7 M lines in the coverage file and 58 M in the cytosine report for a certain sample. In fact, all of the cytosine report files have the same number of lines!

Keep in mind that RRBS in a targeted sequencing which looks at a fraction of the total genomic CpGs, so do not expect to have coverage at the majority of genomic CpGs. I usually see RRBS data with 2-4 million CpGs detected with >10 coverage when sequencing 30-50M reads per sample.

If you do a quick check removing the lines with 0 counts for cytosines and thymines (say using awk '($4 != 0 || $5 != 0) {print $0}') you will see that you lose a lot of lines in the cytosine report.

ADD COMMENT • link 3.8 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Hi Papyrus, just wanted to say thank you for making this clear for me! Sorry I was late in responding!

ADD REPLY • link 3.7 years ago by linelr ▴ 40

0

Entering edit mode

no worries, glad to help!

ADD REPLY • link 3.7 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Hello @Papyrus. Your observations regarding the number of sites in each file clears my doubts about why I see a difference.

I was using methylkit to unite and retrieve the percent methylation values and observed that if I don't use min.per.group in unite() only the sites that are covered in all samples are reported and this is just in hundreds. Is this normal?. On setting it to 1L it increases to ~2m as you mentioned. The sites that are not present in some samples are marked NA and I suppose it is safe to convert them to 0

ADD REPLY • link 2.6 years ago by Arindam Ghosh ▴ 510

1

Entering edit mode

Hi @Arindam, if you say that you have around 2M CpGs covered per sample, having only 100s of CpGs common across all samples seems to be a pretty low number. Do you have lots of samples, from different batches, etc.? The more samples you have, the fewer common CpGs you will find between them. Or maybe you have 1 or 2 outliers with almost no CpGs which are influencing all the rest of the samples?

Regarding inputting NA values for 0 values, IMHO, I would not do that for methylation data. There are other omics data where the scale is [0, infinite), and missing values may be related to low quantities which were undetected (e.g. proteomics, RNA-seq), because the number of reads relates to the signal. However, methylation data carries information across all the [0,1] scale, and the number of reads does not relate to the quantity of methylation (the signal is the proportion of C/(C+T)), so regions with no reads (NA, undetected) could have 0 or 100% methylation, you really cannot know. I would use an imputation strategy (e.g. nearest neighbour, or sampling from the rest of methylation values of the other samples, etc.) or ignore those CpGs for all samples. (I usually ignore them, but it is true that you have lots of missing data, so you may have to implement some imputation strategy)

ADD REPLY • link 2.6 years ago by Papyrus ★ 2.9k