Pyranges: obtaining sets of overlaps or overlap pairs
2
0
Entering edit mode
15 months ago
Bosberg ▴ 50

pyranges functions like overlap and intersect are great for selecting subsets of ranges that overlap, but I need to keep track of _which_ ranges overlap. I can think of two ways to accomplish this (they both have analogies to GRanges, and I'd be happy if I could find something in pyranges that does either A or B ):

A) one approach would be analogous to GRangesLists; let's say I have the following two objects:

pr1: ||------|<--1-->|-----|<-2->||<---3--->|-----------|<--4-->|----------||

pr2: ||-------|<--------- 1 ------->|---|<-------- 2 ----->|---------<-3->-||

and pr2.List_like_overlap(pr1) would return a dict() of Pyranges objects:

>>> pr2.List_like_overlap(pr1)
["1"]
1
2
3
["2"]
3
4
["3"]
#empty

Or something like that (note that pr1.3 appears in the sets for _both_ pr2.1 and pr2.2.) The goal would then be to average over the pr1 entries 1, 2, and 3 and assign the result to pr2.1 (likewise for pr1.3,4-->pr2.2)

B) The alternative would be something like pr2.findOverlapPairs(pr1) --in this case I would simply get back pairs of integers (e.g. tuples, or whatever) telling me the pairs of indices from self and other that overlap:

>>> pr2.findOverlapPairs(pr1)
(1,1)
(1,2)
(1,3)
(2,3)
(2,4)

I could take it from there and just grab entries with the appropriate indices from either object. Perhaps this approach is less overhead

Or perhaps there's some other solution I'm not seeing, but I hope it's clear what I'm looking for. Can anyone suggest a function that behaves like either pr.List_like_overlap() or pr.findOverlapPairs()

Edited for clarity

pyranges overlaps overlap pairs List overlaps • 601 views
3
Entering edit mode
15 months ago

The goal would then be to average over the pr1 entries 1, 2, and 3 and assign the result to pr2.1 (likewise for pr1.3,4-->pr2.2)

Assuming sorted BED5 elements (ID in the fourth column, quantitative signal in the fifth column) and availability of bedmap:

$bedmap --echo --echo-map-id-uniq --mean --delim '\t' pr1.bed pr2.bed > mean_pr2_signal_overlapping_pr1.bed$ bedmap --echo --echo-map-id-uniq --mean --delim '\t' pr2.bed pr1.bed > mean_pr1_signal_overlapping_pr2.bed

These shell commands can be run via subprocess, if Python is required.

One could capture stdout from a subprocess shell to write tuples or to a dict, if needed in that form. Changing the delimiter from a tab to another reserved character may help with parsing the standard output stream.

Here's an example, using two input files pr1.bed and pr2.bed:

These two inputs are a rough abstraction of the genomic coordinates in the example files in your question.

The following script pyranges-mimic.py shows the use of subprocess to stream through the output of bedmap, writing overlap information to a Python dictionary:

Here's a sample run:

\$ ./pyranges-mimic.py pr2.bed pr1.bed
{
"pr2": {
"1": {
"pr1": {
"average_signal": "133.333333",
"elements": [
"3",
"1",
"2"
]
}
},
"2": {
"pr1": {
"average_signal": "175.000000",
"elements": [
"3",
"4"
]
}
},
"3": {}
}
}

This is just one example of how to organize the output. You could modify this data structure to turn it into key-value tuples, however you'd need it for downstream work.

0
Entering edit mode
14 months ago
Bosberg ▴ 50

Original developper recommends .join .

that is: pr1.join( pr2 )

and then one can index the resulting df by pr1 objects...