Hi biostars!
I am working with RRBS data and have used Bismark for methylation calls.
I have compared the two output files with cytosines in a CpG-context and the coverage files (which only consider cytosines in a CpG context), but I canĀ“t get my head around whether they should have the similar number of cytosines. For my data, there are 10 times more cytosines in the CpG-context files (giving me info if the cytosine is methylated or not) than in the coverage files. I guess the coverage files do not report cytosines that are not covered by any reads, but I still think it is a little weird that the difference between the number of cytosine is that great. This is probably due to some lack of understanding on my part, so I really do appreciate any attempt at explaining what is going on here for me
Thanks in advance
Best, Line
Hi Papyrus, just wanted to say thank you for making this clear for me! Sorry I was late in responding!
no worries, glad to help!
Hello @Papyrus. Your observations regarding the number of sites in each file clears my doubts about why I see a difference.
I was using methylkit to unite and retrieve the percent methylation values and observed that if I don't use
min.per.group
inunite()
only the sites that are covered in all samples are reported and this is just in hundreds. Is this normal?. On setting it to 1L it increases to ~2m as you mentioned. The sites that are not present in some samples are marked NA and I suppose it is safe to convert them to 0Hi @Arindam, if you say that you have around 2M CpGs covered per sample, having only 100s of CpGs common across all samples seems to be a pretty low number. Do you have lots of samples, from different batches, etc.? The more samples you have, the fewer common CpGs you will find between them. Or maybe you have 1 or 2 outliers with almost no CpGs which are influencing all the rest of the samples?
Regarding inputting NA values for 0 values, IMHO, I would not do that for methylation data. There are other omics data where the scale is [0, infinite), and missing values may be related to low quantities which were undetected (e.g. proteomics, RNA-seq), because the number of reads relates to the signal. However, methylation data carries information across all the [0,1] scale, and the number of reads does not relate to the quantity of methylation (the signal is the proportion of C/(C+T)), so regions with no reads (NA, undetected) could have 0 or 100% methylation, you really cannot know. I would use an imputation strategy (e.g. nearest neighbour, or sampling from the rest of methylation values of the other samples, etc.) or ignore those CpGs for all samples. (I usually ignore them, but it is true that you have lots of missing data, so you may have to implement some imputation strategy)