Question: How extract the longest overlap?
0
gravatar for star
20 months ago by
star260
Netherlands
star260 wrote:

I have two files that run intersect and find overlap between them using bedtools intersect as bellow: bedtools intersect -a A.bed -b B.bed -wao > intersect.bed . but i would like to extract the longest overlap from my data. Would you please let me know if there is any solution for that.

A.bed

chr   start  end
1     200   250
1     240   300
2     100   120
4     300   360
4     310   400

B.bed

chr   start   end 
1      180   220 
1      210   260  
1      213   348
4      305  352 
4      310   370
4      315  382
4      350  400

Th output of bedtools intersect:

chr start end  chr start end   overlaps.bd
 1  200    250  1  180  220        20
 1  200    250  1  210  260        40 
 1  200    250  1  213  348        37
 1  240    300  1  180  220         0
 1  240    300  1  210  260        20
 1  240    300  1  213  348        60
 4  300    360  4  305  352        47
 4  300    360  4  310  370        50
 4  300    360  4  315  382        45
 4  300    360  4  350  400        10
4  310     400  4  305  352        42
4  310     400  4  310  370        60
4  310     400  4  315  382        67 
4  310     400  4  350  400        50

Expected.bed

chr start  end    chr start end   overlaps.bp 
 1  200    250     1    210  260      40
 1  240   300      1    213  348     60
 4  300    360     4    310  370     50 
 4  310   400      4    315  382     67
ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 20 months ago by star260
2

with datamash, output:

$ datamash  -sH -g 1,2,3 -f max 7 < test.txt | cut -f 1-7 
chr start   end chr start   end overlaps.bd
1   200 250 1   210 260 40
1   240 300 1   213 348 60
4   300 360 4   310 370 50
4   310 400 4   315 382 67

input:

$ cat test.txt 
chr start   end chr start   end overlaps.bd
1   200 250 1   180 220 20
1   200 250 1   210 260 40  
1   200 250 1   213 348 37
1   240 300 1   180 220 0
1   240 300 1   210 260 20
1   240 300 1   213 348 60
4   300 360 4   305 352 47
4   300 360 4   310 370 50
4   300 360 4   315 382 45
4   300 360 4   350 400 10
4   310 400 4   305 352 42
4   310 400 4   310 370 60
4   310 400 4   315 382 67  
4   310 400 4   350 400 50
ADD REPLYlink written 19 months ago by cpad011214k

did you check bedtools sort that allows you to sort by region size (you would have to modify your original intersect command to only return the overlap instead of -wa and -wb)

ADD REPLYlink written 20 months ago by Friederike6.4k

yes I have checked it but it was not useful in my case. I want to extract the largest overlap while the same coordinate in file A have several overlaps with coordinates of B file.

ADD REPLYlink written 20 months ago by star260
4
gravatar for geek_y
20 months ago by
geek_y11k
Barcelona
geek_y11k wrote:

test.bed

1   200 250 1   180 220 20
1   200 250 1   210 260 40
1   200 250 1   213 348 37
1   240 300 1   180 220 0
1   240 300 1   210 260 20
1   240 300 1   213 348 60
4   300 360 4   305 352 47
4   300 360 4   310 370 50
4   300 360 4   315 382 45
4   300 360 4   350 400 10
4   310 400 4   305 352 42
4   310 400 4   310 370 60
4   310 400 4   315 382 67
4   310 400 4   350 400 50

cat test.bed | sort -k1,1 -k2,2n -k7,7nr | groupBy -g 1,2,3,4 -c 7 -o first -full | cut -f1-7

1   200 250 1   210 260 40
1   240 300 1   213 348 60
4   300 360 4   310 370 50
4   310 400 4   315 382 67

Ps: groupBy is bedtools groupBy

ADD COMMENTlink modified 20 months ago • written 20 months ago by geek_y11k
1

didn't know about groupby, that's an elegant solution (that doesn't even need the sort if the information from b are not needed)

 bedtools intersect -a test1.bed -b test2.bed  -wao | bedtools groupby -grp 1-3 -c 7 -o max
ADD REPLYlink modified 20 months ago • written 20 months ago by Friederike6.4k

Thanks for your solution. what is the -o first -full and what is it difference with -o max -full?

ADD REPLYlink modified 20 months ago • written 20 months ago by star260
1

-o max -full provides the first full line irrespective if its max or not. Sorting makes sure the 'first' is the actual full overlap.

ADD REPLYlink written 20 months ago by geek_y11k

Did you try geek_y suggestion and did it solve your problem? Please provide feedback by accepting and / or up-voting helpful / correct answers.

ADD REPLYlink written 20 months ago by h.mon31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1157 users visited in the last hour