Question

findOverlaps + split is slow

0

Entering edit mode

7.8 years ago

floris.barthel ▴ 50

I'm trying to overlap several hundred thousand breakpoint locations with cytobands. Because I want to account for breakpoints spanning a cytoband junction, I'm using findOverlaps and want to use split to gather them by breakpoint.

However, while searching for overlaps is extremely fast, splitting them up is tediously slow. Is there any way to make this faster?

> hits1 = findOverlaps(gr1, cbr, ignore.strand=T)
> hits
> Hits object with 871572 hits and 0 metadata columns:
>            queryHits subjectHits
>            <integer>   <integer>
>        [1]         1           1
>        [2]         2           1
>        [3]         3           1
>        [4]         4           1
>        [5]         5           2
>        ...       ...         ...   [871568]    871562         419   [871569]    871563         421   [871570]    871564         421  
> [871571]    871565         461   [871572]    871566         475  
> -------   queryLength: 871566   subjectLength: 862

This works quick enough, the next step takes forever to finish however:

> hits1 = split(hits1,queryHits(hits1)) ## This is extremely slow

Can I optimize this?

findOverlaps BioConductor split GenomicRanges • 1.6k views

ADD COMMENT • link 7.8 years ago by floris.barthel ▴ 50