I am trying to merge replicates from a bisulfite sequencing experiment. I am working with the output files of the MOABS programme, which gives positions for CpGs and if they were methylated or not. It can be thought of as a bed file with extra columns - some columns have counts I want to sum and keep track of, some I want to be able to specify as "+", "-", or "B" . Would anyone know of a programme that can do this? Or a script somewhere?
I actually wrote a programme in R that works, but with my relatively basic coding it's just too slow for the huge files I have, so I'm looking for a new direction. I've tried bedtools merge too but it doesn't give me the flexibility I need. Same for bedops possibly?
The headers for my data are as follows:
chrom start end ratio totalC methC strand next Plus totalC methC Minus totalC methC
This is a sample of my data:
chr1 3000573 3000575 1 2 2 + G + 2 2 - 0 0
chr1 3000573 3000575 1 5 5 B G + 3 3 - 2 2
chr1 3000573 3000575 1 5 5 B G + 4 4 - 1 1
chr1 3000725 3000727 1 1 1 - G + 0 0 - 1 1
chr1 3000725 3000727 1 1 1 - G + 0 0 - 1 1
And I'm looking to end up with this:
chr1 3000573 3000575 1 12 12 B G + 9 9 - 3 3
chr1 3000725 3000727 1 2 2 - G + 0 0 - 2 2
Any suggestions much appreciated. Thanks in advance!
I would use
awk
to solve such problem (e.g.,group
). Can you please explain what you want to do with column 7 (+/-/B) as this is not clear to me.Column 7 is the awkward one. If both columns 10 and 13 are > 0 then column 7 gets a "B" (meaning counts come from both + and - strands). If only column 10 is > 0 then it's a "+", and only column 13 is > 0 then it's "-".
Could you please say a bit more on which
awk
commands might work? I'm not too familiar with it.I thought I'd written a little python program to do this for MOABS files, but can't seem to find it at the moment. Anyway, this is pretty simple to do in any scripting languages (you could even use awk, but it'd get a bit unwieldy).