My goal is to find an efficient way to do multiple intersections very quickly. Here is my problem:
I am looking to do the equivalent operation to
intersectBed -a file1.bed -b file2.bed except that I would like to do this for ~4000 different genes in file2.bed, and then store the results in a dictionary.
Currently I am trying do this using pybedtools, but am open to any solutions. In the pybedtools documentation, it says "In general, try to create as few BedTool objects as you can. Every time you create a BedTool object, you create a new open file. There is usually a BEDTools program that already does what you want, and will do it faster." (documentation). So instead of making bedtools inside a for loop, I made the follow type of loop:
file1 = pybedtools.BedTool('file1.bed') file2 = pybedtools.BedTool('file2.bed') for gene in genes: file2.filter(lambda bedtool : bedtool == gene) dict[gene] = file1.intersect(file2)
Where I'm filtering for the coordinates with the gene ID, and then doing the intersection, and storing the results in a dictionary with the gene as a key, and the intersections as a value. This is super slow, and I think that there is probably a much better way to do this same task.
What is the recommended way for performing many intersections in a for loop, or getting an equivalent result without a loop?