Question: Fast Interval Intersection Methodologies
17
10.7 years ago by
Biostar User1.0k
Biostar User1.0k wrote:

Most genomic annotations are specified as intervals along the genome.

Provide code examples in your programming language that demonstrate the use of fast interval querying.

genomics • 17k views
modified 3.1 years ago by daniel30 • written 10.7 years ago by Biostar User1.0k

anyone benchmarked a nested containment list against an interval tree? other than pygr are there any NCL's in the wild?

16
10.7 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

This code example generates 10,000 intervals then queries them for overlapping regions. Requires only the presence of Python.

The code below requires the either the installation of the bx python package or alternatively you may just download the quicksect.py module and place it next to the script itself:

from random import randint, seed

# if you can install bx python then uncomment the line below
#
# from bx.intervals.operations.quicksect import IntervalNode

# and place it in next to your program
#
from quicksect import IntervalNode

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
"Generates random interval over a size and span"
lo = randint(10000, SIZE)
hi = lo + randint(1, SPAN)
return (lo, hi)

def find(start, end, tree):
"Returns a list with the overlapping intervals"
out = []
tree.intersect( start, end, lambda x: out.append(x) )
return [ (x.start, x.end) for x in out ]

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# start the root at the first element
start, end = data[0]
tree = IntervalNode( start, end )

# build an interval tree from the rest of the data
for start, end in data[1:]:
tree = tree.insert( start, end )

for start, end in query:
overlap = find(start, end , tree)
print '(%s, %s) -> %s' % (start, end, overlap)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36981, 36984), (36973, 36982), (36978, 36987)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12806, 12811), (12799, 12806), (12809, 12819)]
(30339, 30343) -> [(30336, 30346), (30335, 30345), (30340, 30341)]
10
10.7 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

This code example generates 10,000 intervals then queries them for overlapping regions. The installation requires a C compiler and Python.

This is a faster solution than the other example, one that now requires the full installation of the bx-python module. The data structure is implemented in C and is at least one order of magnitude faster than the quicksect.py module presented in another example.

from random import randint, seed
from bx.intervals.intersection import Intersecter, Interval

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
"Generates random interval over a size and span"
lo = randint(10000, SIZE)
hi = lo + randint(1, SPAN)
return (lo, hi)

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# create the interval tree
tree = Intersecter()

# build an interval tree from the rest of the data
for start, end in data:

# perform the query
for start, end in query:
overlap = tree.find(start, end)
out = [ (x.start, x.end) for x in overlap ]
print '(%s, %s) -> %s' % (start, end, out)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36973, 36982), (36978, 36987), (36981, 36984)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12799, 12806), (12806, 12811), (12809, 12819)]
(30339, 30343) -> [(30335, 30345), (30336, 30346), (30340, 30341)]

This is awesome! Thanks for sharing that, Istvan.

7
10.7 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

An alternative is to use BED files to store data and BEDTools to calculate the intersection.

The BED format is a format used by UCSC and many other projects to generically store annotations on genomes. Sometimes, it is easier to use just plain BED files to store annotations, instead of complex databases or HDF5, and BEDTools make it also faster to access them. Moreover, with BEDFiles you have other advantages, as you can use custom tracks on ucsc or gbrowse and there are other tools that use BED as input.

Anyway, to solve the problem of intersection with BEDTools, you would do:

intersectBed -a segdups.bed -b exons.bed
0
3.1 years ago by
daniel30
United Kingdom
daniel30 wrote:

I've recently developed a C++ library, Mappable, which efficiently handles intervals. It allows you to write stuff like this:

std::vector<MyMappableType> mappables {};
/* insert data */
const auto overlapped = overlap_range(mappables, some_interval); // returns an iterator range
for (const auto& mappable : overlapped) {
std::cout << mappable << std::endl;
}
std::cout << count_contained(mappables, some_interval) << std::endl;

One really nice thing about the library is that the algorithms are independent of the container used to store the data (like the C++ STL).

I've benchmarked against interval tree based methods and the performance advantage can be quite striking - both in memory usage and query time (see benchmark code on Github repo).

1

I would recommend providing a Python interface to it (as an extension library) as that would significantly raise its utility to many more people.

1

I would suggest usage of pybind11 to write such a python extension; this amazing library is really a good choice to expose fast C++ algorithms implementations in python in order to make them more popular as Istvan said.