Question: Bedtools, report feature with highest overlap if overlaps two features
0
gravatar for murphy.charlesj
2.9 years ago by
United States
murphy.charlesj90 wrote:

I have two bed files, one containing non-overlapping windows of 10,000bp of the genome, and the other containing the capture regions for whole-exon sequencing. I want to know (for each 10,000bp window) which capture regions it overlaps. And in the case where a capture region overlaps two 10,000bp windows, just report the one with highest overlap.

I've been playing around with bedtools to do this via commands like,

bedtools closest -a a.bed -b b.bed

But this reports the capture region multiple times in the event it overlaps two 10,000bp windows. I've also tried using the -first and -last option, but they do not report the capture region with highest overlap. Here are the two example bed files and output:

a.bed

1   1   10
1   11  20
1   21  30
1   31  40

b.bed

1   4   9
1   11  14
1   15  24
1   21  31

Output (from the command above):

1   1   10  1   4   9
1   11  20  1   11  14
1   11  20  1   15  24
1   21  30  1   15  24
1   21  30  1   29  39
1   31  40  1   29  39
genome • 1.4k views
ADD COMMENTlink modified 2.9 years ago by Alex Reynolds29k • written 2.9 years ago by murphy.charlesj90

I think you should check intersectBed option instead of closestBed. Also check groupBy from BedTools to find the regions with highest overlap.

ADD REPLYlink written 2.9 years ago by venu6.3k

Thanks for the tip. I tried the following command and it seems to work:

bedtools intersect -a a.bed -b b.bed -wao | bedtools groupby -g 4,5,6 -c 7 -o max -full

The output:

1   1   10  1   4   9   5   5
1   11  20  1   11  14  3   3
1   11  20  1   15  24  5   5
1   21  30  1   29  39  1   8

EDIT: oops, actually this does not seem to work, let me tinker a bit more

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by murphy.charlesj90
3
gravatar for Alex Reynolds
2.9 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could do this with a couple bedmap operations:

$ bedmap --count --echo --echo-map 1.bed 2.bed \
    | awk -F'|' '($1==1)' \
    | cut -d'|' -f2- \
    | tr '|' '\t' \
    > single_overlaps.bed

$ bedmap --count --echo --echo-map --echo-overlap-size 1.bed 2.bed \
    | awk -F'|' '($1==2)' \
    | cut -d'|' -f2- \
    | awk -F'|' '{ 
        s = split($3, sizes, ";"); \
        o = split($2, overlaps, ";"); \
        x = 0; \
        y = 0; \ 
        for (i = 1; i <= s; i++) { \
            if (x <= sizes[i]) { \
                x = sizes[i]; \
                y = i; \
            } \
        } \
        print $1"\t"overlaps[y]; \
    }' > max_of_paired_overlaps.bed

$ bedops --everything single_overlaps.bed max_of_paired_overlaps.bed > answer.bed

Ties go to downstream elements.

ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by Alex Reynolds29k
1

Thanks, this solution does work.

ADD REPLYlink written 2.9 years ago by murphy.charlesj90
1
gravatar for murphy.charlesj
2.9 years ago by
United States
murphy.charlesj90 wrote:

Never mind, I'll just do this using the IRanges R package via the overlap() function. I was initially hesitant to use R for this out of fear that it would be slow, but the overlap() function is pretty fast.

ADD COMMENTlink written 2.9 years ago by murphy.charlesj90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2074 users visited in the last hour