Tool: (pre-alpha) pyranges: performant, pythonic GenomicRanges
0
gravatar for endrebak
5 months ago by
endrebak680
endrebak680 wrote:

GenomicRanges for Python.

This library tries to be a thin, but extremely useful wrapper around genomic data contained in pandas dataframes. This allows for all the wonderful functionality of bedtools/bedops and/or GenomicRanges, while being able to use the the enormous universe of Python datascience libraries to manipulate and do computations on the data.

PyRanges also contains a run-length encoding library for extremely efficient arithmetic computation of scores associated with genomic intervals.

Repo: https://github.com/endrebak/pyranges

Docs: http://pyranges.readthedocs.io/

pip install pyranges # Try the examples in the docs, whydontcha

Most desired: feedback, bug reports and ideas. I do not need PR's yet as the underlying code might change greatly.

>>> import pyranges as pr

>>> cs = pr.load_dataset("chipseq")

>>> cs

+--------------|-----------|-----------|--------|---------|----------+
| Chromosome   | Start     | End       | Name   | Score   | Strand   |
|--------------|-----------|-----------|--------|---------|----------|
| chr8         | 28510032  | 28510057  | U0     | 0       | -        |
| chr7         | 107153363 | 107153388 | U0     | 0       | -        |
| chr5         | 135821802 | 135821827 | U0     | 0       | -        |
| ...          | ...       | ...       | ...    | ...     | ...      |
| chr6         | 89296757  | 89296782  | U0     | 0       | -        |
| chr1         | 194245558 | 194245583 | U0     | 0       | +        |
| chr8         | 57916061  | 57916086  | U0     | 0       | +        |
+--------------|-----------|-----------|--------|---------|----------+
PyRanges object has 10000 sequences from 24 chromosomes.

>>> bg = pr.load_dataset("chipseq_background")

>>> cs.nearest(bg, suffix="_IP")

+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
| Chromosome   | Start    | End      | Name   | Score   | Strand   | Chromosome_IP   | Start_IP   | End_IP   | Name_IP   | Score_IP   | Strand_IP   | Distance   |
|--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------|
| chr1         | 1325303  | 1325328  | U0     | 0       | -        | chr1            | 1041102    | 1041127  | U0        | 0          | +           | 284176     |
| chr1         | 1541598  | 1541623  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 228760     |
| chr1         | 1599121  | 1599146  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 171237     |
| ...          | ...      | ...      | ...    | ...     | ...      | ...             | ...        | ...      | ...       | ...        | ...         | ...        |
| chrY         | 21910706 | 21910731 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1353516    |
| chrY         | 22054002 | 22054027 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1496812    |
| chrY         | 22210637 | 22210662 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1653447    |
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
PyRanges object has 10000 sequences from 24 chromosomes.

>>> cs.set_intersection(bg, strandedness="opposite")

+--------------|-----------|-----------|----------+
| Chromosome   |     Start |       End | Strand   |
|--------------|-----------|-----------|----------|
| chr1         | 226987603 | 226987617 | +        |
| chr8         |  38747236 |  38747251 | -        |
+--------------|-----------|-----------|----------+
PyRanges object has 2 sequences from 2 chromosomes.

>>> cv = cs.coverage(stranded=True)
>>> cv

chr1 +
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
| Runs   |   1541598 |   25 |   57498 |   25 |   1904886 |  ...    |   25 |   2952580 |   25 |   1156833 |   25 |
|--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------|
| Values |         0 |    1 |       0 |    1 |         0 | ...     |    1 |         0 |    1 |         0 |    1 |
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
| Runs   |   7046809 |   25 |   358542 |   25 |   296582 |  ...    |   25 |   143271 |   25 |   156610 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |        0 |    1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.

>>> cv + 10.42

chr1 +
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
| Runs   |   1541598 |    25 |   57498 |    25 |   1904886 |  ...    |    25 |   2952580 |    25 |   1156833 |    25 |
|--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------|
| Values |     10.42 | 11.42 |   10.42 | 11.42 |     10.42 | ...     | 11.42 |     10.42 | 11.42 |     10.42 | 11.42 |
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
| Runs   |   7046809 |    25 |   358542 |    25 |   296582 |  ...    |    25 |   143271 |    25 |   156610 |    25 |
|--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------|
| Values |     10.42 | 11.42 |    10.42 | 11.42 |    10.42 | ...     | 11.42 |    10.42 | 11.42 |    10.42 | 11.42 |
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.

>>> bg_cv = bg.coverage()

>>> cv - bg_cv
chr1
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
| Runs   |   887771 |   25 |   106864 |   25 |   46417 |  ...    |   25 |   730068 |   25 |   259250 |   25 |
|--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------|
| Values |        0 |   -1 |        0 |   -1 |       0 | ...     |    1 |        0 |   -1 |        0 |    1 |
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
Rle of length 247134924 containing 3242 elements
...
chrY
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
| Runs   |   7046809 |   25 |   147506 |   25 |   211011 |  ...    |   25 |   156610 |   25 |   35191552 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |          0 |   -1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
Rle of length 57402239 containing 60 elements
Unstranded PyRles object with 25 chromosomes.

(Sorry for the bump. I wanted to add some examples, plus a better description.)

tool genomicranges python • 288 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by endrebak680
1

What are the cliff-notes in terms of how this differs from something like https://github.com/vsbuffalo/BioRanges ?

ADD REPLYlink written 5 months ago by jrj.healey7.7k
1

BioRanges was never finished and I have seen no timings. PyRanges seems to be reaching feature parity with GenomicRanges soon. The greatest difference is perhaps that I try to make a dinky convenient wrapper around pandas dfs so that all the good stuff from GenomicRanges can be used on dfs while still allowing numpy/scipy/pandas to be used directly on the data to operate on it.

Anyways, great q. Something I should update the docs/README with.

ADD REPLYlink modified 5 months ago • written 5 months ago by endrebak680
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour