Question

How to merge replicates in bed file, but keep counts of additional columns?

0

Entering edit mode

10.3 years ago

YanO ▴ 140

I am trying to merge replicates from a bisulfite sequencing experiment. I am working with the output files of the MOABS programme, which gives positions for CpGs and if they were methylated or not. It can be thought of as a bed file with extra columns - some columns have counts I want to sum and keep track of, some I want to be able to specify as "+", "-", or "B" . Would anyone know of a programme that can do this? Or a script somewhere?

I actually wrote a programme in R that works, but with my relatively basic coding it's just too slow for the huge files I have, so I'm looking for a new direction. I've tried bedtools merge too but it doesn't give me the flexibility I need. Same for bedops possibly?

The headers for my data are as follows:

chrom    start    end    ratio    totalC    methC    strand    next    Plus    totalC    methC    Minus    totalC    methC

This is a sample of my data:

chr1    3000573    3000575    1    2    2    +    G    +    2    2    -    0    0
chr1    3000573    3000575    1    5    5    B    G    +    3    3    -    2    2
chr1    3000573    3000575    1    5    5    B    G    +    4    4    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1

And I'm looking to end up with this:

chr1    3000573    3000575    1    12    12    B    G    +    9    9    -    3   3
chr1    3000725    3000727    1    2     2     -    G    +    0    0    -    2   2

Any suggestions much appreciated. Thanks in advance!

merge sequencing bed • 4.9k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by YanO ▴ 140

0

Entering edit mode

I would use awk to solve such problem (e.g., group). Can you please explain what you want to do with column 7 (+/-/B) as this is not clear to me.

ADD REPLY • link 10.3 years ago by PoGibas 5.1k

0

Entering edit mode

Column 7 is the awkward one. If both columns 10 and 13 are > 0 then column 7 gets a "B" (meaning counts come from both + and - strands). If only column 10 is > 0 then it's a "+", and only column 13 is > 0 then it's "-".

Could you please say a bit more on which awk commands might work? I'm not too familiar with it.

ADD REPLY • link updated 4.7 years ago by Ram 44k • written 10.3 years ago by YanO ▴ 140

0

Entering edit mode

I thought I'd written a little python program to do this for MOABS files, but can't seem to find it at the moment. Anyway, this is pretty simple to do in any scripting languages (you could even use awk, but it'd get a bit unwieldy).

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

Ram · Answer 1 · 2014-06-10

Using bedtools groupby it should be quite simple, although I don't completely understand the final desired output...

Input example:

cat test.bed
chr1    3000573    3000575    1    2    2    +    G    +    2    2    -    0    0
chr1    3000573    3000575    1    5    5    B    G    +    3    3    -    2    2
chr1    3000573    3000575    1    5    5    B    G    +    4    4    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1

Assuming input is sorted by chromosome and position and it has no header:

groupBy -g 1,2,3 \
    -c 5,6,7,8,9,10,11,12,13,14 \
    -o sum,sum,distinct,distinct,distinct,sum,sum,distinct,sum,sum \
    -i test.bed \
| awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/$5, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}'

You get:

chr1    3000573    3000575    1    12    12    +,B    G    +    9    9    -    3    3
chr1    3000725    3000727    1    2    2    -    G    +    0    0    -    2    2

The awk command is just to restore the "ratio" column to be col 5 / col 6.

As it is not clear to me what to do with with columns 7, 8, 9 and 12 I put the option "distinct", which gives a comma separated list of the unique elements found.

score 0 · Answer 2 · 2014-06-10

Strip the header line and use BEDOPS sort-bed to sort your BED file:

$ tail -n +2 replicates.unsorted.bed | sort-bed - > replicates.bed

Sorting allows you to compare one line to the next and decide how to handle state variables. If your data were not sorted, then you would not be able to process the BED file as a stream.

So, as mentioned, use awk to process your sorted BED file in serial:

$ awk ' \
    BEGIN { \
        FS = "\t"; \
        previousChr = "chrFoo"; \
        previousStart = 0; \
        previousStop = 0; \
        resetFlag = 0; \
    } \
    { \
        split($0, elements, "\t"); \
        currentChr = elements[1]; \
        currentStart = elements[2]; \
        currentStop = elements[3]; \
        if (NR = 1) { \
            currentTotalCSum = elements[5]; \
            currentMethCSum = elements[6]; \
            currentPlusTotalCSum = elements[10]; \
            currentPlusMethCSum = elements[11]; \
            currentMinusTotalCSum = elements[13]; \
            currentMinusMethCSum = elements[14]; \
            resetFlag = 1; \
        } \
        else if (resetFlag == 0) { \
            print previousChr, previousStart, previousStop, currentTotalCSum, currentMethCSum, currentPlusTotalCSum, currentPlusMethCSum, currentMinusTotalCSum, currentMinusMethCSum;
            currentTotalCSum = elements[5]; \
            currentPlusTotalCSum = elements[10]; \
            currentMinusTotalCSum = elements[13]; \
            currentMethCSum = elements[6]; \
            currentPlusMethCSum = elements[11]; \
            currentMinusMethCSum = elements[14]; \
            resetFlag = 1; \
        } \
        else if ((currentChr == previousChr) && (currentStart == previousStart) && (currentStop == previousStop)) { \
            currentTotalCSum += elements[5]; \
            currentPlusTotalCSum += elements[10]; \
            currentMinusTotalCSum += elements[13]; \
            currentMethCSum += elements[5]; \
            currentPlusMethCSum += elements[10]; \
            currentMinusMethCSum += elements[13]; \
        } \
        else { \
            resetFlag = 0; \
        } \
        previousChr = currentChr; \
        previousStart = currentStart; \
        previousStop = currentStop; \
    } \
' replicates.bed > answer.bed

The file answer.bed should have a format close to what you're after, but lacks some columns which you can fill in yourself.

For example, it isn't clear how you handle logic for strand, so I'll leave that to you to add the necessary conditionals.

I also leave out "ratio" and some other columns for brevity, which you can add to the print statement.