Question

bedtools intersect: keep pairing consistent

0

Entering edit mode

8.5 years ago

GouthamAtla 12k

I have two bed files (let's say B1.bed, B2.bed) to be queried against another file (A.bed).

A.bed

B1.bed

1 1 10 reg1
1 1 10 reg2
1 30 40 reg3
1 30 40 reg4

B2.bed

1 30 40 reg1
1 60 70 reg2
1 80 100 reg3
1 110 130 reg4

I would like to know what would be the way to get an output regions such that the reg1 of B1 and reg1 of B2 both are intersecting a feature in A, and so on for ref2, reg3, reg4.

something like:

if reg of B1 and reg of B2 intersects with A: print

I could use the pybedtools:

https://gist.github.com/gouthamatla/c5a7b9f33b8cb78ba305

output:

True
False # Because reg2 of B2.bed does not intersect with A.bed
True
True

Here I do not see an option that A.intersect could read data as string, hence I have to convert every interval to a bed file before using it, which might be taking very very long time for the script to run on large data sets.

Is there any other way to do it or to optimize my snippet ? For me its more or less sounds like checking if both the fragments are mapped to same regions, using bedtools

P.S: I don't know how to title this question.

bedtools • 2.9k views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 8.5 years ago by GouthamAtla 12k

Ram · Answer 1 · 2016-02-05

1

Entering edit mode

8.5 years ago

Chris Miller 22k

Okay, so intersect A with B1 first, then with B2.

bedtools intersect -a <(bedtools intersect -a A.bed -b B1.bed -wa) -b B2.bed -c | cut -f 4 >output

That'll give you a 0 or 1 for each item in A, stating whether it intersects both B1 and B2.

If you want the actual lines of A that intersect, replace -c with -wa

ADD COMMENT • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by Chris Miller 22k

0

Entering edit mode

No. the regions are given same identifiers in B1 and B2. Check if the reg1 of B1 and reg1 of B2 overlaps with A. It is not necessary that the reg1 of B1 overlaps with reg1 of B2.

ADD REPLY • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by GouthamAtla 12k

0

Entering edit mode

Okay, let's try again - I edited my answer above.

ADD REPLY • link 8.5 years ago by Chris Miller 22k

Ram · Answer 2 · 2016-02-05

0

Entering edit mode

8.5 years ago

GouthamAtla 12k

Just to update, from the author of pybedtools, Ryan Dale:

You're right, there's a lot of overhead with this strategy. This is a good candidate for using the IntervalFile object, which accesses the underlying C++ directly. You'll probably want to use the any_hits() method.

So your example can be modified to the following:

import pybedtools
bed1=pybedtools.BedTool("B1.bed")
bed2=pybedtools.BedTool("B2.bed")
A=pybedtools.BedTool("A.bed")
a = A.as_intervalfile()

for count, (b1, b2) in enumerate(zip(bed1, bed2)):
    if a.any_hits(b1) and a.any_hits(b2):
        print("True")
    else:
        print("False")

Using some BED files I had laying around each with 3000+ lines, the original version took about 2 mins to run; this new version takes just over one second.

ADD COMMENT • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by GouthamAtla 12k

0

Entering edit mode

FWIW, you can also pre-sort your bed files and add the -sorted option to the command line bedtools, which should speed things up considerably.

ADD REPLY • link 8.5 years ago by Chris Miller 22k

0

Entering edit mode

Thanks Chris. I am sorry if I am not clear in my question. I wanted to process both the B1 and B2 together ( for example as a paired samples. Like in the script zip(bed1, bed2)). Which is to check, if line1 of B1 and line1 of B2 overlaps any feature in A. stop. go to next line. check if line2 of B1 and line2 of B2 overlaps with any feature of A. likewise. I do not want to lose the pairing nature of lines. With bedtools, I process them independently. I am really sorry if it still confusing.

ADD REPLY • link updated 4.5 years ago by Ram 44k • written 8.5 years ago by GouthamAtla 12k

0

Entering edit mode

I see, so you're not looking for intersections within the whole file, just trying to determine whether there's a match on line N of A and line N of B1 and B2. Well then, yes, your python solution is a good one

ADD REPLY • link 8.5 years ago by Chris Miller 22k