Hello, does anybody have suggestions for how to simulate RRBS reads with a known methylation level? I have a reference for a non-model species and want to make sure reads map appropriately before beginning data generation.
Anything helps, thank you!
Hello, does anybody have suggestions for how to simulate RRBS reads with a known methylation level? I have a reference for a non-model species and want to make sure reads map appropriately before beginning data generation.
Anything helps, thank you!
There is a bisulfite sequence simulator that may help. What you also need is a BED file of regions that are well covered in the genome with RRBS. Then use bedtools getFasta to extract the sequence of all RRBS covered regions. Then use Sherman to generate simulated bisulfite seq data for these regions.
Ideally RRBS should select genome fragments starting and ending with CCGG and of size between, say, 50 and 1000bp, right? So you could first extract such fragments and produce a reference genome containing only these fragments. Then you could run Sherman as suggested by mark.ziemann on this restricted reference. By the way, if you just want to know how many theoretical fragments you can obtain from RRBS and whether these are mappable you don't need to assume a particular methylation level since BS-Seq mappers bioinformatically "demethylate" all the reads before mapping.
I wrote a simple program to extract sequences from a fasta file matching a regular expression (fastaRegexFinder.py). To obtain RRBS fragments from a reference you could use something like:
fastaRegexFinder.py -q -r 'CCGG(.(?<!CCGG)){50,1000}(?=CCGG)' -f genome.fa --noreverse
The extracted sequences are in the last column of output, you should append CCGG to these. Example:
cat genome.fa
>seq1
zzCCGGaaaCCGGbbbbCCGGxxxxxxCCGGdddCCGGeeeeCCGGCCGGfffCCGGhhhhCCGGiiiiiCCGGlllCCGGmmmmCCGGzz
>seq2
zzCCGGaaaaaaaCCGGbbbbbbbCCGGzzzzzzzzzzzzzzzzzzzzzzzzzzzzzCCGGaaa
>seq3
zzCCGGaaaaaaaCCGGCCGGbbbbbbbCCGGzzz
fastaRegexFinder.py -q -r 'CCGG(.(?<!CCGG)){5,10}(?=CCGG)' -f genome.fa --noreverse
seq1 17 27 seq1_17_27_for 10 + CCGGxxxxxx
seq1 61 70 seq1_61_70_for 9 + CCGGiiiii
seq2 2 13 seq2_2_13_for 11 + CCGGaaaaaaa
seq2 13 24 seq2_13_24_for 11 + CCGGbbbbbbb
seq3 2 13 seq3_2_13_for 11 + CCGGaaaaaaa
seq3 17 28 seq3_17_28_for 11 + CCGGbbbbbbb
Did I get it right? does it make sense?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
theoretically, for directional RRBS library, reads start with CGG or TGG. why in a real experiment we see reads start from 3 different bases (although very few of them)? Thanks.
Star activity (MspI or whatever else is used cutting outside its normal site) or the general degradation caused by bisulfite treatment are the likely culprits in my mind.