bedtools intersect: keep pairing consistent
2
0
Entering edit mode
5.6 years ago

I have two bed files ( lets say B1.bed, B2.bed ) to be queried against another file ( A.bed ).

A.bed

1 1 10
1 30 40
1 80 100
1 110 130

B1.bed

1 1 10 reg1
1 1 10 reg2
1 30 40 reg3
1 30 40 reg4

B2.bed

1 30 40 reg1
1 60 70 reg2
1 80 100 reg3
1 110 130 reg4

I would like to know what would be the way to get an output regions such that the  reg1 of B1 and reg1 of B2 both are intersecting a feature in A, and so on for ref2, reg3, reg4.

something like:

if reg of B1 and reg of B2 intersects with A: print


I could use the pybedtools:

output:

True
False # Because reg2 of B2.bed does not intersect with A.bed
True
True

Here I do not see an option that A.intersect could read data as string, hence I have to convert every interval to a bed file before using it, which might be taking very very long time for the script to run on large data sets.

Is there any other way to do it or to optimize my snippet ? For me its more or less sounds like checking if both the fragments are mapped to same regions, using bedtools

P.S: I dont know how to title this question.

bedtools • 1.7k views
1
Entering edit mode
5.6 years ago

Okay, so intersect A with B1 first, then with B2.

bedtools intersect -a <(bedtools intersect -a A.bed -b B1.bed -wa) -b B2.bed -c | cut -f 4 >output


That'll give you a 0 or 1 for each item in A, stating whether it intersects both B1 and B2.

If you want the actual lines of A that intersect, replace -c with -wa

0
Entering edit mode

No. the regions are given same identifiers in B1 and B2. Check if the reg1 of B1 and reg1 of B2 overlaps with A. It is not necessary that the reg1 of B1 overlaps with reg1 of B2.

0
Entering edit mode

Okay, let's try again - I edited my answer above.

0
Entering edit mode
5.6 years ago

Just to update, from the author of pybedtools, Ryan Dale:

You're right, there's a lot of overhead with this strategy. This is a good candidate for using the IntervalFile object, which accesses the underlying C++ directly. You'll probably want to use the any_hits() method.

So your example can be modified to the following:

import pybedtools
bed1=pybedtools.BedTool("B1.bed")
bed2=pybedtools.BedTool("B2.bed")
A=pybedtools.BedTool("A.bed")
a = A.as_intervalfile()

for count, (b1, b2) in enumerate(zip(bed1, bed2)):
if a.any_hits(b1) and a.any_hits(b2):
print("True")
else:
print("False")


Using some BED files I had laying around each with 3000+ lines, the original version took about 2 mins to run; this new version takes just over one second.

0
Entering edit mode

FWIW, you can also pre-sort your bed files and add the -sorted option to the command line bedtools, which should speed things up considerably.

0
Entering edit mode

Thanks Chris. I am sorry if I am not clear in my question. I wanted to process both the B1 and B2 together ( for example as a paired samples. Like in the script zip(bed1, bed2)). Which is to check, if line1 of B1 and line1 of B2 overlaps any feature in A. stop. go to next line. check if line2 of B1 and line2 of B2 overlaps with any feature of A. likewise. I do not want to lose the pairing nature of lines. With bedtools, I process them independently. I am really sorry if it still confusing.

0
Entering edit mode

I see, so you're not looking for intersections within the whole file, just trying to determine whether there's a match on line N of A and line N of B1 and B2. Well then, yes, your python solution is a good one