Question: How extract the longest overlap?
0
gravatar for star
5 months ago by
star190
Netherlands
star190 wrote:

I have two files that run intersect and find overlap between them using bedtools intersect as bellow: bedtools intersect -a A.bed -b B.bed -wao > intersect.bed . but i would like to extract the longest overlap from my data. Would you please let me know if there is any solution for that.

A.bed

chr   start  end
1     200   250
1     240   300
2     100   120
4     300   360
4     310   400

B.bed

chr   start   end 
1      180   220 
1      210   260  
1      213   348
4      305  352 
4      310   370
4      315  382
4      350  400

Th output of bedtools intersect:

chr start end  chr start end   overlaps.bd
 1  200    250  1  180  220        20
 1  200    250  1  210  260        40 
 1  200    250  1  213  348        37
 1  240    300  1  180  220         0
 1  240    300  1  210  260        20
 1  240    300  1  213  348        60
 4  300    360  4  305  352        47
 4  300    360  4  310  370        50
 4  300    360  4  315  382        45
 4  300    360  4  350  400        10
4  310     400  4  305  352        42
4  310     400  4  310  370        60
4  310     400  4  315  382        67 
4  310     400  4  350  400        50

Expected.bed

chr start  end    chr start end   overlaps.bp 
 1  200    250     1    210  260      40
 1  240   300      1    213  348     60
 4  300    360     4    310  370     50 
 4  310   400      4    315  382     67
ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 5 months ago by star190
2

with datamash, output:

$ datamash  -sH -g 1,2,3 -f max 7 < test.txt | cut -f 1-7 
chr start   end chr start   end overlaps.bd
1   200 250 1   210 260 40
1   240 300 1   213 348 60
4   300 360 4   310 370 50
4   310 400 4   315 382 67

input:

$ cat test.txt 
chr start   end chr start   end overlaps.bd
1   200 250 1   180 220 20
1   200 250 1   210 260 40  
1   200 250 1   213 348 37
1   240 300 1   180 220 0
1   240 300 1   210 260 20
1   240 300 1   213 348 60
4   300 360 4   305 352 47
4   300 360 4   310 370 50
4   300 360 4   315 382 45
4   300 360 4   350 400 10
4   310 400 4   305 352 42
4   310 400 4   310 370 60
4   310 400 4   315 382 67  
4   310 400 4   350 400 50
ADD REPLYlink written 5 months ago by cpad011211k

did you check bedtools sort that allows you to sort by region size (you would have to modify your original intersect command to only return the overlap instead of -wa and -wb)

ADD REPLYlink written 5 months ago by Friederike4.8k

yes I have checked it but it was not useful in my case. I want to extract the largest overlap while the same coordinate in file A have several overlaps with coordinates of B file.

ADD REPLYlink written 5 months ago by star190
4
gravatar for geek_y
5 months ago by
geek_y9.8k
Barcelona
geek_y9.8k wrote:

test.bed

1   200 250 1   180 220 20
1   200 250 1   210 260 40
1   200 250 1   213 348 37
1   240 300 1   180 220 0
1   240 300 1   210 260 20
1   240 300 1   213 348 60
4   300 360 4   305 352 47
4   300 360 4   310 370 50
4   300 360 4   315 382 45
4   300 360 4   350 400 10
4   310 400 4   305 352 42
4   310 400 4   310 370 60
4   310 400 4   315 382 67
4   310 400 4   350 400 50

cat test.bed | sort -k1,1 -k2,2n -k7,7nr | groupBy -g 1,2,3,4 -c 7 -o first -full | cut -f1-7

1   200 250 1   210 260 40
1   240 300 1   213 348 60
4   300 360 4   310 370 50
4   310 400 4   315 382 67

Ps: groupBy is bedtools groupBy

ADD COMMENTlink modified 5 months ago • written 5 months ago by geek_y9.8k
1

didn't know about groupby, that's an elegant solution (that doesn't even need the sort if the information from b are not needed)

 bedtools intersect -a test1.bed -b test2.bed  -wao | bedtools groupby -grp 1-3 -c 7 -o max
ADD REPLYlink modified 5 months ago • written 5 months ago by Friederike4.8k

Thanks for your solution. what is the -o first -full and what is it difference with -o max -full?

ADD REPLYlink modified 5 months ago • written 5 months ago by star190
1

-o max -full provides the first full line irrespective if its max or not. Sorting makes sure the 'first' is the actual full overlap.

ADD REPLYlink written 5 months ago by geek_y9.8k

Did you try geek_y suggestion and did it solve your problem? Please provide feedback by accepting and / or up-voting helpful / correct answers.

ADD REPLYlink written 5 months ago by h.mon26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 705 users visited in the last hour