pythonic equivalent to reduce() in R GRanges - how to collapse ranged data?
1
2
Entering edit mode
6.8 years ago

I posted this question on stackoverflow, but it did not get a response. It is a bioinformatics query, so perhaps this is a better forum: Is there are pythonic equivalent to the R ranges operation below?

In R (albeit longwinded):

Here is a test data.frame

df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )

First I make GRanges object:

gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )

Then I reduce the intervals to collapse into new granges object:

reduced <- reduce(gr)

Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.

subjectHits(findOverlaps(gr, reduced))

Output:

> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3

How do I do this in Python?

R python • 2.4k views
ADD COMMENT
1
Entering edit mode

in what structure is your data stored in python?

ADD REPLY
0
Entering edit mode

The data is stored as a CSV to disk. As a python newbie, I guess I would load as a pandas table, but I am open to suggestions.

ADD REPLY
4
Entering edit mode
5.7 years ago
endrebak ▴ 980

https://github.com/biocore-ntnu/pyranges

import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
print(gr.cluster())
# +--------------+-----------+-----------+-----------+
# |   Chromosome |     Start |       End |   Cluster |
# |       (int8) |   (int32) |   (int32) |   (int64) |
# |--------------+-----------+-----------+-----------|
# |            1 |       100 |       150 |         1 |
# |            1 |       200 |       350 |         2 |
# |            1 |       300 |       400 |         2 |
# |            2 |       100 |       500 |         3 |
# |            2 |       400 |       450 |         3 |
# +--------------+-----------+-----------+-----------+

It will be out in 0.0.21. Thanks for the idea!

ADD COMMENT

Login before adding your answer.

Traffic: 1746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6