Question: Efficiently storing methylation data from BS-sequencing
gravatar for Chris Miller
4.7 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

How are people efficiently storing methylation data derived from bisulfite sequencing?

The results spit out from using BSmap's associated tools include a value for each cytosine in the genome, in something like the below format.  There's clearly some redundancy here, which makes it highly compressable, but we're still looking at gigabytes of files in yet another poorly-documented format.  

Has the community coalesced around a format that has decent toolchains attached? The sequence context, in particular, seems useful, but not easily slammed into a bed file. VCF, as always, seems flexible enough to handle this, but might be overkill. Is there something else (maybe even binary) that I'm not finding with a search? Thoughts appreciated.


Example output here:

chr  pos       strand  context  ratio  eff_CT_count  C_count  CT_count  rev_G_count  rev_GA_count  CI_lower  CI_upper
22   16050004  +       ATCTG    0.000  1.00          0        1         0            0             0.000     0.793
22   16050013  +       GTCCC    0.000  3.00          0        3         0            0             0.000     0.562
22   16050014  +       TCCCA    0.000  3.00          0        3         0            0             0.000     0.562


Update: So far the consensus on twitter seems to be that there's been no standard (or even well-defined) format. Would love to hear if anyone knows differently

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Chris Miller21k

Twitter's correct, there's no real standard. I hear someone is working on a VCF variant, but who knows when that'll be finished (not to mention that that will end up being the N+1th format for this type of data). Frankly, I think most of us use modified bedGraph files, like those produced by bismark, bison and PileOMeth...but those often don't include context.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan94k

is there any (large)  file to test somewhere ?

ADD REPLYlink written 4.7 years ago by Pierre Lindenbaum126k

Im a newcomer to BS-Seq - why is storing the context is important?
Also, why should we store C->G and  CT->GA events separately?

Why not just have chr, pos, #C, #G ?

ADD REPLYlink written 4.7 years ago by John12k

I'm not a methylation expert either, but my understanding is that whether context is important depends completely on the types of questions you're asking. The immediately adjacent base is certainly important (CpG/CHH/CHG contexts) and farther flanking bases may be important clues about mechanisms of methylation, drug binding, etc

ADD REPLYlink written 4.7 years ago by Chris Miller21k

yeah for analysis you'd want extra fields :)
...still, for storage of the data in its most compact form, i'd have thought chr/pos/C/G is all you need. Perhaps you could even go to just chr/pos/CpG/GpC to throw away all the C's that aren't part of a CpG...
I would be interested to know how many C's not part of a CpG are not converted, and if that's due to real methylation of non-CpG Cs or incomplete bisulphite conversion.

Still, anything beyond that is redundant id have thought. Ratios, strand, etc, can all be calculated on-the-fly quicker than reading the next row off the file, so why make your file twice as big :D

ADD REPLYlink written 4.7 years ago by John12k

The non-CpG methylation rate is species-dependent. For most mammals, you'd expect low single-digit percentages, though. I hear that this is different in plants, but I've never worked on them. The bedGraph-variant format that I mentioned above is essentially what you mentioned. Typically, you only store a single context in a given file. This still has an extra column, though, since often people want to collapse metrics into CpG level rather than single-C (to get higher power, since methylation tends to be symmetric).

ADD REPLYlink written 4.7 years ago by Devon Ryan94k

Thanks Devon :)
So that might suggest that methylated non-CpG Cytosines are informative..?
I wonder if they cluster at certain genomic regions, or if they even have the same function as methylated CpGs. I need to research that a bit more because my understanding is a pretty poor.
About the symmetry though, that's a very good point. For a CpG you'd probably want a format that measures the methylation ratio of the forward and reverse strand on the same row, like:
chr | pos | C_F | C_R | G_F | G_R 
where _F is the methylation (or lack of) at pos, and _R is the methylation (or lack of) at pos+1. The CpG is a single 'thing', and so I guess it makes more sense to make that a single row than two. I originally thought this is what was going on in the OP"s example with the "rev" columns (standing for reverse strand), but then I realised it can't be that since row 2/3 are talking about the same CpG just at pos and pos+1, so rev must stand for reverted or something.

ADD REPLYlink written 4.7 years ago by John12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1517 users visited in the last hour