findOverlaps + split is slow
0
0
Entering edit mode
5.2 years ago

I'm trying to overlap several hundred thousand breakpoint locations with cytobands. Because I want to account for breakpoints spanning a cytoband junction, I'm using findOverlaps and want to use split to gather them by breakpoint.

However, while searching for overlaps is extremely fast, splitting them up is tediously slow. Is there any way to make this faster?

> hits1 = findOverlaps(gr1, cbr, ignore.strand=T)
> hits
> Hits object with 871572 hits and 0 metadata columns:
>            queryHits subjectHits
>            <integer>   <integer>
>        [1]         1           1
>        [2]         2           1
>        [3]         3           1
>        [4]         4           1
>        [5]         5           2
>        ...       ...         ...   [871568]    871562         419   [871569]    871563         421   [871570]    871564         421  
> [871571]    871565         461   [871572]    871566         475  
> -------   queryLength: 871566   subjectLength: 862

This works quick enough, the next step takes forever to finish however:

> hits1 = split(hits1,queryHits(hits1)) ## This is extremely slow

Can I optimize this?

findOverlaps BioConductor split GenomicRanges • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 1237 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6