Efficiently performing intersections for multiple subsets of a BED file
1
0
Entering edit mode
5.6 years ago
hersh ▴ 60

My goal is to find an efficient way to do multiple intersections very quickly. Here is my problem:

I am looking to do the equivalent operation to intersectBed -a file1.bed -b file2.bed except that I would like to do this for ~4000 different genes in file2.bed, and then store the results in a dictionary.

Currently I am trying do this using pybedtools, but am open to any solutions. In the pybedtools documentation, it says "In general, try to create as few BedTool objects as you can. Every time you create a BedTool object, you create a new open file. There is usually a BEDTools program that already does what you want, and will do it faster." (documentation). So instead of making bedtools inside a for loop, I made the follow type of loop:

file1 = pybedtools.BedTool('file1.bed')
file2 = pybedtools.BedTool('file2.bed')

for gene in genes:

    file2.filter(lambda bedtool : bedtool[3] == gene)

    dict[gene] = file1.intersect(file2)

Where I'm filtering for the coordinates with the gene ID, and then doing the intersection, and storing the results in a dictionary with the gene as a key, and the intersections as a value. This is super slow, and I think that there is probably a much better way to do this same task.

What is the recommended way for performing many intersections in a for loop, or getting an equivalent result without a loop?

Much appreciated!

sequence • 2.4k views
ADD COMMENT
0
Entering edit mode
5.6 years ago

It should be reasonably fast to do this on sort-bed-sorted inputs A.bed and B.bed:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '$4==gene' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done
$ sort-bed intersect.unsorted.bed > intersect.sorted.bed

Gene IDs will be in the fourth column of intersect.sorted.bed.

You can read this file into a Python dictionary object, and then do stuff with that object, whatever that is:

#!/usr/bin/env python

import sys

"""
$ ./parseBed4.py < intersect.sorted.bed
"""

genes = {}
for line in sys.stdin:
  (chr, start, stop, id) = line.rstrip().split('\t')
  if id not in genes:
    genes[id] = []
  genes[id].append({'chr' : chr, 'start' : start, 'stop' : stop})

# do stuff with genes

Because intersect.sorted.bed is sorted, the records in genes[id] will be in sort order, also. This could be useful for binary searches, etc.

ADD COMMENT
0
Entering edit mode

Hey Alex thanks for your answer. I dowloaded bedops, and tried running the bash commands in your answer, but unfortunately I am getting the following error "awk: can't open file -vgene=cat source line number 1 awk: invalid -v option". I put the genes in a text file with one ID per line. Thanks, once I get this to work I hope that using the sorted option will make it faster.

ADD REPLY
0
Entering edit mode

Are you using the built-in awk on OS X? If so, you might use Homebrew to install GNU awk: brew install awk.

ADD REPLY
0
Entering edit mode

I installed awk using brew and now get "awk: can't open file -vgene=cat source line number 1 awk: syntax error at source line 1 context is >>> genes. <<< txt"

ADD REPLY
0
Entering edit mode

What happens if you run:

$ for gene in `cat genes.txt`; do echo ${gene}; done

Also, do you have any weird line endings in genes.txt? Perhaps copy and paste the output from running this so we can take a look:

$ cat -te genes.txt | head
ADD REPLY
0
Entering edit mode
XR_001781769.1$
XM_006520483.3$
NM_010601.3$
XM_006520484.2$
NM_016766.3$
NM_001164156.1$
XR_001781768.1$
XR_384436.3$
XR_384437.3$
XR_384438.3$
ADD REPLY
0
Entering edit mode

It prints the gene IDs without any weird endings

ADD REPLY
0
Entering edit mode

Can you do the same for A.bed and B.bed:

$ cat -te A.bed | head
...
$ cat -te B.bed | head
...

I'm reviewing things and I'm not sure what's wrong.

ADD REPLY
0
Entering edit mode

for A.bed:

chr15^I3057295^I3057329^IGGGCACTGAAACCCCATGATGCTGCTGTCAGCTTG^I46.85$ chr15^I3089214^I3089250^ITTTTCTCTGTGAGACCTGATAGGGATCCAAAGGAGGA^I42.31$ chr15^I3105599^I3105635^IGGACCTTAGCATGACTGTGCTACTCTCTGTTTAGTGC^I42.41$ chr15^I3130755^I3130794^ICATGTTTTGCAGTGTGAGGATGTGTTTAAGGTTACCTGTG^I42.19$ chr15^I3131618^I3131652^ICCCCACCTCAGCCTAATGAACTGGTCCTGAAGACT^I44.72$ chr15^I3132245^I3132279^IAAAATTGTGAGGCCATGTGGATGCACATGCACATG^I43.34$ chr15^I3133991^I3134026^ITGGCTGCTGTACTTCAGTTATGGCATCTCATCTCTG^I42.33$ chr15^I3136454^I3136492^IGAGTTGTGCTATGGTACAAGGGAAGAAAGTGAGCATTGT^I42.61$ chr15^I3138597^I3138632^ITTTTCTATATCGTGTTTGCTATGGAGGCACTGGGCT^I42.41$ chr15^I3139001^I3139035^ITGAAGGTAGTCCTTCCGGCGGTCTTCACGATACAG^I44.58$

for B.bed:

chr15^I99213319^I99213652^IXR_001781769.1_exon_0_0_chr15_99213320_r^I0^I-$ chr15^I99216165^I99216355^IXR_001781769.1_exon_1_0_chr15_99216166_r^I0^I-$ chr15^I99225146^I99225327^IXR_001781769.1_exon_2_0_chr15_99225147_r^I0^I-$ chr15^I99224359^I99225051^IXM_006520483.3_exon_0_0_chr15_99224360_f^I0^I+$ chr15^I99226307^I99226541^IXM_006520483.3_exon_1_0_chr15_99226308_f^I0^I+$ chr15^I99226980^I99227115^IXM_006520483.3_exon_2_0_chr15_99226981_f^I0^I+$ chr15^I99227890^I99228042^IXM_006520483.3_exon_3_0_chr15_99227891_f^I0^I+$ chr15^I99228472^I99228716^IXM_006520483.3_exon_4_0_chr15_99228473_f^I0^I+$ chr15^I99229058^I99229216^IXM_006520483.3_exon_5_0_chr15_99229059_f^I0^I+$ chr15^I99229305^I99229513^IXM_006520483.3_exon_6_0_chr15_99229306_f^I0^I+$

ADD REPLY
0
Entering edit mode

Okay, your input is a little different than I assumed.

Perhaps try a regular expression, instead:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ regex = "^"gene; if ($4 ~ regex) { print $0; } }' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done

I tested the awk commands and they worked for me. Hope this helps in some way.

ADD REPLY

Login before adding your answer.

Traffic: 2131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6