Question: Fast Interval Intersection Methodologies
17
gravatar for Biostar User
9.5 years ago by
Biostar User1.0k
Biostar User1.0k wrote:

Most genomic annotations are specified as intervals along the genome.

Provide code examples in your programming language that demonstrate the use of fast interval querying.

genomics • 15k views
ADD COMMENTlink modified 21 months ago by daniel30 • written 9.5 years ago by Biostar User1.0k

anyone benchmarked a nested containment list against an interval tree? other than pygr are there any NCL's in the wild?

ADD REPLYlink written 9.3 years ago by brentp23k
16
gravatar for Istvan Albert
9.5 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

This code example generates 10,000 intervals then queries them for overlapping regions. Requires only the presence of Python.

The code below requires the either the installation of the bx python package or alternatively you may just download the quicksect.py module and place it next to the script itself:

from random import randint, seed

# if you can install bx python then uncomment the line below
#
# from bx.intervals.operations.quicksect import IntervalNode

# otherwise just download the quickset module as shown above 
# and place it in next to your program
#
from quicksect import IntervalNode

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

def find(start, end, tree):
    "Returns a list with the overlapping intervals"
    out = []
    tree.intersect( start, end, lambda x: out.append(x) )
    return [ (x.start, x.end) for x in out ]

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# start the root at the first element
start, end = data[0]
tree = IntervalNode( start, end )

# build an interval tree from the rest of the data
for start, end in data[1:]:
    tree = tree.insert( start, end )

for start, end in query:
    overlap = find(start, end , tree)
    print '(%s, %s) -> %s' % (start, end, overlap)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36981, 36984), (36973, 36982), (36978, 36987)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12806, 12811), (12799, 12806), (12809, 12819)]
(30339, 30343) -> [(30336, 30346), (30335, 30345), (30340, 30341)]
ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.5 years ago by Istvan Albert ♦♦ 81k
10
gravatar for Istvan Albert
9.5 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

This code example generates 10,000 intervals then queries them for overlapping regions. The installation requires a C compiler and Python.

This is a faster solution than the other example, one that now requires the full installation of the bx-python module. The data structure is implemented in C and is at least one order of magnitude faster than the quicksect.py module presented in another example.

from random import randint, seed
from bx.intervals.intersection import Intersecter, Interval

# the span of the generated intervals
SPAN = 10

# the size of the genome
SIZE = 5*10**4

# the number of intervals
N = 10**4

def generate(x):
    "Generates random interval over a size and span"
    lo = randint(10000, SIZE)
    hi = lo + randint(1, SPAN)
    return (lo, hi)

# use this to force both examples to generate the same data
seed(10)

# generate 10 thousand random intervals
data = map(generate, xrange(N))

# generate the intervals to query over
query = map(generate, xrange(10))

# create the interval tree
tree = Intersecter()

# build an interval tree from the rest of the data
for start, end in data:
    tree.add_interval( Interval(start, end) )

# perform the query
for start, end in query:
    overlap = tree.find(start, end)
    out = [ (x.start, x.end) for x in overlap ]
    print '(%s, %s) -> %s' % (start, end, out)

Produces the output:

(41901, 41903) -> [(41894, 41902)]
(36981, 36987) -> [(36973, 36982), (36978, 36987), (36981, 36984)]
(36338, 36339) -> [(36337, 36347)]
(32741, 32748) -> [(32738, 32742)]
(49864, 49872) -> [(49859, 49865)]
(21475, 21477) -> []
(29425, 29428) -> [(29418, 29426), (29419, 29426)]
(29590, 29599) -> [(29586, 29595), (29596, 29598)]
(12804, 12811) -> [(12799, 12806), (12806, 12811), (12809, 12819)]
(30339, 30343) -> [(30335, 30345), (30336, 30346), (30340, 30341)]
ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.5 years ago by Istvan Albert ♦♦ 81k

This is awesome! Thanks for sharing that, Istvan.

ADD REPLYlink written 6.6 years ago by a.zielezinski8.7k
7
gravatar for Giovanni M Dall'Olio
9.5 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

An alternative is to use BED files to store data and BEDTools to calculate the intersection.

The BED format is a format used by UCSC and many other projects to generically store annotations on genomes. Sometimes, it is easier to use just plain BED files to store annotations, instead of complex databases or HDF5, and BEDTools make it also faster to access them. Moreover, with BEDFiles you have other advantages, as you can use custom tracks on ucsc or gbrowse and there are other tools that use BED as input.

Anyway, to solve the problem of intersection with BEDTools, you would do:

intersectBed -a segdups.bed -b exons.bed
ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.5 years ago by Giovanni M Dall'Olio26k
0
gravatar for daniel
21 months ago by
daniel30
United Kingdom
daniel30 wrote:

I've recently developed a C++ library, Mappable, which efficiently handles intervals. It allows you to write stuff like this:

std::vector<MyMappableType> mappables {};
/* insert data */
const auto overlapped = overlap_range(mappables, some_interval); // returns an iterator range
for (const auto& mappable : overlapped) {
    std::cout << mappable << std::endl;
}
std::cout << count_contained(mappables, some_interval) << std::endl;

One really nice thing about the library is that the algorithms are independent of the container used to store the data (like the C++ STL).

I've benchmarked against interval tree based methods and the performance advantage can be quite striking - both in memory usage and query time (see benchmark code on Github repo).

ADD COMMENTlink written 21 months ago by daniel30
1

I would recommend providing a Python interface to it (as an extension library) as that would significantly raise its utility to many more people.

ADD REPLYlink written 21 months ago by Istvan Albert ♦♦ 81k
1

I would suggest usage of pybind11 to write such a python extension; this amazing library is really a good choice to expose fast C++ algorithms implementations in python in order to make them more popular as Istvan said.

ADD REPLYlink modified 21 months ago • written 21 months ago by edrezen720
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1040 users visited in the last hour