Question: Pyranges: obtaining sets of overlaps or overlap pairs
0
gravatar for Bosberg
2 days ago by
Bosberg30
Germany, MPI
Bosberg30 wrote:

pyranges functions like overlap and intersect are great for selecting subsets of ranges that overlap, but I need to keep track of _which_ ranges overlap. I can think of two ways to accomplish this (they both have analogies to GRanges, and I'd be happy if I could find something in pyranges that does either A or B ):

A) one approach would be analogous to GRangesLists; let's say I have the following two objects:

pr1: ||------|<--1-->|-----|<-2->||<---3--->|-----------|<--4-->|----------||

pr2: ||-------|<--------- 1 ------->|---|<-------- 2 ----->|---------<-3->-||

and pr2.List_like_overlap(pr1) would return a dict() of Pyranges objects:

>>> pr2.List_like_overlap(pr1)
["1"]
1
2
3
["2"]
3
4
["3"]
#empty

Or something like that (note that pr1.3 appears in the sets for _both_ pr2.1 and pr2.2.) The goal would then be to average over the pr1 entries 1, 2, and 3 and assign the result to pr2.1 (likewise for pr1.3,4-->pr2.2)

B) The alternative would be something like pr2.findOverlapPairs(pr1) --in this case I would simply get back pairs of integers (e.g. tuples, or whatever) telling me the pairs of indices from self and other that overlap:

>>> pr2.findOverlapPairs(pr1)
(1,1)
(1,2)
(1,3)
(2,3)
(2,4)

I could take it from there and just grab entries with the appropriate indices from either object. Perhaps this approach is less overhead

Or perhaps there's some other solution I'm not seeing, but I hope it's clear what I'm looking for. Can anyone suggest a function that behaves like either pr.List_like_overlap() or pr.findOverlapPairs()

Edited for clarity

ADD COMMENTlink modified 1 day ago by Alex Reynolds31k • written 2 days ago by Bosberg30
2
gravatar for Alex Reynolds
1 day ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

The goal would then be to average over the pr1 entries 1, 2, and 3 and assign the result to pr2.1 (likewise for pr1.3,4-->pr2.2)

Assuming sorted BED5 elements (ID in the fourth column, quantitative signal in the fifth column) and availability of bedmap:

$ bedmap --echo --echo-map-id-uniq --mean --delim '\t' pr1.bed pr2.bed > mean_pr2_signal_overlapping_pr1.bed
$ bedmap --echo --echo-map-id-uniq --mean --delim '\t' pr2.bed pr1.bed > mean_pr1_signal_overlapping_pr2.bed

These shell commands can be run via subprocess, if Python is required.

One could capture stdout from a subprocess shell to write tuples or to a dict, if needed in that form. Changing the delimiter from a tab to another reserved character may help with parsing the standard output stream.

Here's an example, using two input files pr1.bed and pr2.bed:

These two inputs are a rough abstraction of the genomic coordinates in the example files in your question.

The following script pyranges-mimic.py shows the use of subprocess to stream through the output of bedmap, writing overlap information to a Python dictionary:

Here's a sample run:

$ ./pyranges-mimic.py pr2.bed pr1.bed
{
  "pr2": {
    "1": {
      "pr1": {
        "average_signal": "133.333333",
        "elements": [
          "3",
          "1",
          "2"
        ]
      }
    },
    "2": {
      "pr1": {
        "average_signal": "175.000000",
        "elements": [
          "3",
          "4"
        ]
      }
    },
    "3": {}
  }
}

This is just one example of how to organize the output. You could modify this data structure to turn it into key-value tuples, however you'd need it for downstream work.

ADD COMMENTlink modified 7 hours ago • written 1 day ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1510 users visited in the last hour
_