How to merge replicates in bed file, but keep counts of additional columns?
2
0
Entering edit mode
9.9 years ago
YanO ▴ 140

I am trying to merge replicates from a bisulfite sequencing experiment. I am working with the output files of the MOABS programme, which gives positions for CpGs and if they were methylated or not. It can be thought of as a bed file with extra columns - some columns have counts I want to sum and keep track of, some I want to be able to specify as "+", "-", or "B" . Would anyone know of a programme that can do this? Or a script somewhere?

I actually wrote a programme in R that works, but with my relatively basic coding it's just too slow for the huge files I have, so I'm looking for a new direction. I've tried bedtools merge too but it doesn't give me the flexibility I need. Same for bedops possibly?

The headers for my data are as follows:

chrom    start    end    ratio    totalC    methC    strand    next    Plus    totalC    methC    Minus    totalC    methC

This is a sample of my data:

chr1    3000573    3000575    1    2    2    +    G    +    2    2    -    0    0
chr1    3000573    3000575    1    5    5    B    G    +    3    3    -    2    2
chr1    3000573    3000575    1    5    5    B    G    +    4    4    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1

And I'm looking to end up with this:

chr1    3000573    3000575    1    12    12    B    G    +    9    9    -    3   3
chr1    3000725    3000727    1    2     2     -    G    +    0    0    -    2   2

Any suggestions much appreciated. Thanks in advance!

merge sequencing bed • 4.6k views
ADD COMMENT
0
Entering edit mode

I would use awk to solve such problem (e.g., group). Can you please explain what you want to do with column 7 (+/-/B) as this is not clear to me.

ADD REPLY
0
Entering edit mode

Column 7 is the awkward one. If both columns 10 and 13 are > 0 then column 7 gets a "B" (meaning counts come from both + and - strands). If only column 10 is > 0 then it's a "+", and only column 13 is > 0 then it's "-".

Could you please say a bit more on which awk commands might work? I'm not too familiar with it.

ADD REPLY
0
Entering edit mode

I thought I'd written a little python program to do this for MOABS files, but can't seem to find it at the moment. Anyway, this is pretty simple to do in any scripting languages (you could even use awk, but it'd get a bit unwieldy).

ADD REPLY
2
Entering edit mode
9.9 years ago

Using bedtools groupby it should be quite simple, although I don't completely understand the final desired output...

Input example:

cat test.bed
chr1    3000573    3000575    1    2    2    +    G    +    2    2    -    0    0
chr1    3000573    3000575    1    5    5    B    G    +    3    3    -    2    2
chr1    3000573    3000575    1    5    5    B    G    +    4    4    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1
chr1    3000725    3000727    1    1    1    -    G    +    0    0    -    1    1

Assuming input is sorted by chromosome and position and it has no header:

groupBy -g 1,2,3 \
    -c 5,6,7,8,9,10,11,12,13,14 \
    -o sum,sum,distinct,distinct,distinct,sum,sum,distinct,sum,sum \
    -i test.bed \
| awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/$5, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}'

You get:

chr1    3000573    3000575    1    12    12    +,B    G    +    9    9    -    3    3
chr1    3000725    3000727    1    2    2    -    G    +    0    0    -    2    2

The awk command is just to restore the "ratio" column to be col 5 / col 6.

As it is not clear to me what to do with with columns 7, 8, 9 and 12 I put the option "distinct", which gives a comma separated list of the unique elements found.

ADD COMMENT
0
Entering edit mode

This worked great, thank you! Just one thing in case others are interested too, - the "ratio" column should be $5/$4, not the other way around as above. I was able to sort out column 7 in R then afterwards.

ADD REPLY
0
Entering edit mode
9.9 years ago

Strip the header line and use BEDOPS sort-bed to sort your BED file:

$ tail -n +2 replicates.unsorted.bed | sort-bed - > replicates.bed

Sorting allows you to compare one line to the next and decide how to handle state variables. If your data were not sorted, then you would not be able to process the BED file as a stream.

So, as mentioned, use awk to process your sorted BED file in serial:

$ awk ' \
    BEGIN { \
        FS = "\t"; \
        previousChr = "chrFoo"; \
        previousStart = 0; \
        previousStop = 0; \
        resetFlag = 0; \
    } \
    { \
        split($0, elements, "\t"); \
        currentChr = elements[1]; \
        currentStart = elements[2]; \
        currentStop = elements[3]; \
        if (NR = 1) { \
            currentTotalCSum = elements[5]; \
            currentMethCSum = elements[6]; \
            currentPlusTotalCSum = elements[10]; \
            currentPlusMethCSum = elements[11]; \
            currentMinusTotalCSum = elements[13]; \
            currentMinusMethCSum = elements[14]; \
            resetFlag = 1; \
        } \
        else if (resetFlag == 0) { \
            print previousChr, previousStart, previousStop, currentTotalCSum, currentMethCSum, currentPlusTotalCSum, currentPlusMethCSum, currentMinusTotalCSum, currentMinusMethCSum;
            currentTotalCSum = elements[5]; \
            currentPlusTotalCSum = elements[10]; \
            currentMinusTotalCSum = elements[13]; \
            currentMethCSum = elements[6]; \
            currentPlusMethCSum = elements[11]; \
            currentMinusMethCSum = elements[14]; \
            resetFlag = 1; \
        } \
        else if ((currentChr == previousChr) && (currentStart == previousStart) && (currentStop == previousStop)) { \
            currentTotalCSum += elements[5]; \
            currentPlusTotalCSum += elements[10]; \
            currentMinusTotalCSum += elements[13]; \
            currentMethCSum += elements[5]; \
            currentPlusMethCSum += elements[10]; \
            currentMinusMethCSum += elements[13]; \
        } \
        else { \
            resetFlag = 0; \
        } \
        previousChr = currentChr; \
        previousStart = currentStart; \
        previousStop = currentStop; \
    } \
' replicates.bed > answer.bed

The file answer.bed should have a format close to what you're after, but lacks some columns which you can fill in yourself.

For example, it isn't clear how you handle logic for strand, so I'll leave that to you to add the necessary conditionals.

I also leave out "ratio" and some other columns for brevity, which you can add to the print statement.

ADD COMMENT
0
Entering edit mode

Thank you for such a detailed answer! I will definitely give this a go.

ADD REPLY

Login before adding your answer.

Traffic: 2301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6