Question: Efficiently performing intersections for multiple subsets of a BED file
0
gravatar for hersh
2.2 years ago by
hersh0
hersh0 wrote:

My goal is to find an efficient way to do multiple intersections very quickly. Here is my problem:

I am looking to do the equivalent operation to intersectBed -a file1.bed -b file2.bed except that I would like to do this for ~4000 different genes in file2.bed, and then store the results in a dictionary.

Currently I am trying do this using pybedtools, but am open to any solutions. In the pybedtools documentation, it says "In general, try to create as few BedTool objects as you can. Every time you create a BedTool object, you create a new open file. There is usually a BEDTools program that already does what you want, and will do it faster." (documentation). So instead of making bedtools inside a for loop, I made the follow type of loop:

file1 = pybedtools.BedTool('file1.bed')
file2 = pybedtools.BedTool('file2.bed')

for gene in genes:

    file2.filter(lambda bedtool : bedtool[3] == gene)

    dict[gene] = file1.intersect(file2)

Where I'm filtering for the coordinates with the gene ID, and then doing the intersection, and storing the results in a dictionary with the gene as a key, and the intersections as a value. This is super slow, and I think that there is probably a much better way to do this same task.

What is the recommended way for performing many intersections in a for loop, or getting an equivalent result without a loop?

Much appreciated!

sequence • 815 views
ADD COMMENTlink modified 2.2 years ago by Alex Reynolds31k • written 2.2 years ago by hersh0
0
gravatar for Alex Reynolds
2.2 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

It should be reasonably fast to do this on sort-bed-sorted inputs A.bed and B.bed:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '$4==gene' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done
$ sort-bed intersect.unsorted.bed > intersect.sorted.bed

Gene IDs will be in the fourth column of intersect.sorted.bed.

You can read this file into a Python dictionary object, and then do stuff with that object, whatever that is:

#!/usr/bin/env python

import sys

"""
$ ./parseBed4.py < intersect.sorted.bed
"""

genes = {}
for line in sys.stdin:
  (chr, start, stop, id) = line.rstrip().split('\t')
  if id not in genes:
    genes[id] = []
  genes[id].append({'chr' : chr, 'start' : start, 'stop' : stop})

# do stuff with genes

Because intersect.sorted.bed is sorted, the records in genes[id] will be in sort order, also. This could be useful for binary searches, etc.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Alex Reynolds31k

Hey Alex thanks for your answer. I dowloaded bedops, and tried running the bash commands in your answer, but unfortunately I am getting the following error "awk: can't open file -vgene=cat source line number 1 awk: invalid -v option". I put the genes in a text file with one ID per line. Thanks, once I get this to work I hope that using the sorted option will make it faster.

ADD REPLYlink written 2.2 years ago by hersh0

Are you using the built-in awk on OS X? If so, you might use Homebrew to install GNU awk: brew install awk.

ADD REPLYlink written 2.2 years ago by Alex Reynolds31k

I installed awk using brew and now get "awk: can't open file -vgene=cat source line number 1 awk: syntax error at source line 1 context is >>> genes. <<< txt"

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by hersh0

What happens if you run:

$ for gene in `cat genes.txt`; do echo ${gene}; done

Also, do you have any weird line endings in genes.txt? Perhaps copy and paste the output from running this so we can take a look:

$ cat -te genes.txt | head
ADD REPLYlink written 2.2 years ago by Alex Reynolds31k
XR_001781769.1$
XM_006520483.3$
NM_010601.3$
XM_006520484.2$
NM_016766.3$
NM_001164156.1$
XR_001781768.1$
XR_384436.3$
XR_384437.3$
XR_384438.3$
ADD REPLYlink modified 2.2 years ago by Alex Reynolds31k • written 2.2 years ago by hersh0

It prints the gene IDs without any weird endings

ADD REPLYlink written 2.2 years ago by hersh0

Can you do the same for A.bed and B.bed:

$ cat -te A.bed | head
...
$ cat -te B.bed | head
...

I'm reviewing things and I'm not sure what's wrong.

ADD REPLYlink written 2.2 years ago by Alex Reynolds31k

for A.bed:

chr15^I3057295^I3057329^IGGGCACTGAAACCCCATGATGCTGCTGTCAGCTTG^I46.85$ chr15^I3089214^I3089250^ITTTTCTCTGTGAGACCTGATAGGGATCCAAAGGAGGA^I42.31$ chr15^I3105599^I3105635^IGGACCTTAGCATGACTGTGCTACTCTCTGTTTAGTGC^I42.41$ chr15^I3130755^I3130794^ICATGTTTTGCAGTGTGAGGATGTGTTTAAGGTTACCTGTG^I42.19$ chr15^I3131618^I3131652^ICCCCACCTCAGCCTAATGAACTGGTCCTGAAGACT^I44.72$ chr15^I3132245^I3132279^IAAAATTGTGAGGCCATGTGGATGCACATGCACATG^I43.34$ chr15^I3133991^I3134026^ITGGCTGCTGTACTTCAGTTATGGCATCTCATCTCTG^I42.33$ chr15^I3136454^I3136492^IGAGTTGTGCTATGGTACAAGGGAAGAAAGTGAGCATTGT^I42.61$ chr15^I3138597^I3138632^ITTTTCTATATCGTGTTTGCTATGGAGGCACTGGGCT^I42.41$ chr15^I3139001^I3139035^ITGAAGGTAGTCCTTCCGGCGGTCTTCACGATACAG^I44.58$

for B.bed:

chr15^I99213319^I99213652^IXR_001781769.1_exon_0_0_chr15_99213320_r^I0^I-$ chr15^I99216165^I99216355^IXR_001781769.1_exon_1_0_chr15_99216166_r^I0^I-$ chr15^I99225146^I99225327^IXR_001781769.1_exon_2_0_chr15_99225147_r^I0^I-$ chr15^I99224359^I99225051^IXM_006520483.3_exon_0_0_chr15_99224360_f^I0^I+$ chr15^I99226307^I99226541^IXM_006520483.3_exon_1_0_chr15_99226308_f^I0^I+$ chr15^I99226980^I99227115^IXM_006520483.3_exon_2_0_chr15_99226981_f^I0^I+$ chr15^I99227890^I99228042^IXM_006520483.3_exon_3_0_chr15_99227891_f^I0^I+$ chr15^I99228472^I99228716^IXM_006520483.3_exon_4_0_chr15_99228473_f^I0^I+$ chr15^I99229058^I99229216^IXM_006520483.3_exon_5_0_chr15_99229059_f^I0^I+$ chr15^I99229305^I99229513^IXM_006520483.3_exon_6_0_chr15_99229306_f^I0^I+$

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by hersh0

Okay, your input is a little different than I assumed.

Perhaps try a regular expression, instead:

$ for gene in `cat genes.txt`; do bedops --intersect A.bed <(awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ regex = "^"gene; if ($4 ~ regex) { print $0; } }' B.bed) | awk -vgene=${gene} -vOFS="\t" -vFS="\t" '{ print $0, gene }'  >> intersect.unsorted.bed; done

I tested the awk commands and they worked for me. Hope this helps in some way.

ADD REPLYlink written 2.2 years ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1220 users visited in the last hour