Question

Fast Interval Intersection Methodologies

17

Entering edit mode

15.4 years ago

Biostar User ★ 1.0k

Most genomic annotations are specified as intervals along the genome.

Interval trees have been known to provide an efficient datastructure that allows for very fast overlap querying.
Nested Containment Lists have been proposed as an even faster alternative

Provide code examples in your programming language that demonstrate the use of fast interval querying.

genomics • 21k views

ADD COMMENT • link updated 20 months ago by Ram 45k • written 15.4 years ago by Biostar User ★ 1.0k

0

Entering edit mode

anyone benchmarked a nested containment list against an interval tree? other than pygr are there any NCL's in the wild?

ADD REPLY • link 15.2 years ago by brentp 24k

Ram · Answer 1 · 2010-03-01

This code example generates 10,000 intervals then queries them for overlapping regions. Requires only the presence of Python.

The code below requires the either the installation of the bx python package or alternatively you may just download the quicksect.py module and place it next to the script itself:

from random import randint, seed

# if you can install bx python then uncomment the line below
#
# from bx.intervals.operations.quicksect import IntervalNode

# otherwise just download the quickset module as shown above 
# and place it in next to your program
#
from quicksect import IntervalNode

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

def find(start, end, tree):
    "Returns a list with the overlapping intervals"
    out = []
    tree.intersect( start, end, lambda x: out.append(x) )
    return [ (x.start, x.end) for x in out ]

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# start the root at the first element
start, end = data[0]
tree = IntervalNode( start, end )

# build an interval tree from the rest of the data
for start, end in data[1:]:
    tree = tree.insert( start, end )

for start, end in query:
    overlap = find(start, end , tree)
    print '(%s, %s) -> %s' % (start, end, overlap)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36981, 36984), (36973, 36982), (36978, 36987)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12806, 12811), (12799, 12806), (12809, 12819)]
(30339, 30343) -> [(30336, 30346), (30335, 30345), (30340, 30341)]

Ram · Answer 2 · 2010-03-01

This code example generates 10,000 intervals then queries them for overlapping regions. The installation requires a C compiler and Python.

This is a faster solution than the other example, one that now requires the full installation of the bx-python module. The data structure is implemented in C and is at least one order of magnitude faster than the quicksect.py module presented in another example.

from random import randint, seed
from bx.intervals.intersection import Intersecter, Interval

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# create the interval tree
tree = Intersecter()

# build an interval tree from the rest of the data
for start, end in data:
    tree.add_interval( Interval(start, end) )

# perform the query
for start, end in query:
    overlap = tree.find(start, end)
    out = [ (x.start, x.end) for x in overlap ]
    print '(%s, %s) -> %s' % (start, end, out)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36973, 36982), (36978, 36987), (36981, 36984)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12799, 12806), (12806, 12811), (12809, 12819)]
(30339, 30343) -> [(30335, 30345), (30336, 30346), (30340, 30341)]

Ram · Answer 3 · 2010-03-02

An alternative is to use BED files to store data and BEDTools to calculate the intersection.

The BED format is a format used by UCSC and many other projects to generically store annotations on genomes. Sometimes, it is easier to use just plain BED files to store annotations, instead of complex databases or HDF5, and BEDTools make it also faster to access them. Moreover, with BEDFiles you have other advantages, as you can use custom tracks on ucsc or gbrowse and there are other tools that use BED as input.

Anyway, to solve the problem of intersection with BEDTools, you would do:

intersectBed -a segdups.bed -b exons.bed

score 0 · Answer 4 · 2017-10-30

0

Entering edit mode

7.7 years ago

daniel ▴ 30

I've recently developed a C++ library, Mappable, which efficiently handles intervals. It allows you to write stuff like this:

std::vector<MyMappableType> mappables {};
/* insert data */
const auto overlapped = overlap_range(mappables, some_interval); // returns an iterator range
for (const auto& mappable : overlapped) {
    std::cout << mappable << std::endl;
}
std::cout << count_contained(mappables, some_interval) << std::endl;

One really nice thing about the library is that the algorithms are independent of the container used to store the data (like the C++ STL).

I've benchmarked against interval tree based methods and the performance advantage can be quite striking - both in memory usage and query time (see benchmark code on Github repo).

ADD COMMENT • link 7.7 years ago by daniel ▴ 30

1

Entering edit mode

I would recommend providing a Python interface to it (as an extension library) as that would significantly raise its utility to many more people.

ADD REPLY • link 7.7 years ago by Istvan Albert 102k

1

Entering edit mode

I would suggest usage of pybind11 to write such a python extension; this amazing library is really a good choice to expose fast C++ algorithms implementations in python in order to make them more popular as Istvan said.

ADD REPLY • link 7.7 years ago by edrezen ▴ 730