GenomicRanges for Python.
This library tries to be a thin, but extremely useful wrapper around genomic data contained in pandas dataframes. This allows for all the wonderful functionality of bedtools/bedops and/or GenomicRanges, while being able to use the the enormous universe of Python datascience libraries to manipulate and do computations on the data.
PyRanges also contains a run-length encoding library for extremely efficient arithmetic computation of scores associated with genomic intervals.
Repo: https://github.com/endrebak/pyranges
Docs: http://pyranges.readthedocs.io/
pip install pyranges # Try the examples in the docs, whydontcha
Most desired: feedback, bug reports and ideas. I do not need PR's yet as the underlying code might change greatly.
>>> import pyranges as pr
>>> cs = pr.load_dataset("chipseq")
>>> cs
+--------------|-----------|-----------|--------|---------|----------+
| Chromosome | Start | End | Name | Score | Strand |
|--------------|-----------|-----------|--------|---------|----------|
| chr8 | 28510032 | 28510057 | U0 | 0 | - |
| chr7 | 107153363 | 107153388 | U0 | 0 | - |
| chr5 | 135821802 | 135821827 | U0 | 0 | - |
| ... | ... | ... | ... | ... | ... |
| chr6 | 89296757 | 89296782 | U0 | 0 | - |
| chr1 | 194245558 | 194245583 | U0 | 0 | + |
| chr8 | 57916061 | 57916086 | U0 | 0 | + |
+--------------|-----------|-----------|--------|---------|----------+
PyRanges object has 10000 sequences from 24 chromosomes.
>>> bg = pr.load_dataset("chipseq_background")
>>> cs.nearest(bg, suffix="_IP")
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
| Chromosome | Start | End | Name | Score | Strand | Chromosome_IP | Start_IP | End_IP | Name_IP | Score_IP | Strand_IP | Distance |
|--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------|
| chr1 | 1325303 | 1325328 | U0 | 0 | - | chr1 | 1041102 | 1041127 | U0 | 0 | + | 284176 |
| chr1 | 1541598 | 1541623 | U0 | 0 | + | chr1 | 1770383 | 1770408 | U0 | 0 | - | 228760 |
| chr1 | 1599121 | 1599146 | U0 | 0 | + | chr1 | 1770383 | 1770408 | U0 | 0 | - | 171237 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | 21910706 | 21910731 | U0 | 0 | - | chrY | 20557165 | 20557190 | U0 | 0 | + | 1353516 |
| chrY | 22054002 | 22054027 | U0 | 0 | - | chrY | 20557165 | 20557190 | U0 | 0 | + | 1496812 |
| chrY | 22210637 | 22210662 | U0 | 0 | - | chrY | 20557165 | 20557190 | U0 | 0 | + | 1653447 |
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
PyRanges object has 10000 sequences from 24 chromosomes.
>>> cs.set_intersection(bg, strandedness="opposite")
+--------------|-----------|-----------|----------+
| Chromosome | Start | End | Strand |
|--------------|-----------|-----------|----------|
| chr1 | 226987603 | 226987617 | + |
| chr8 | 38747236 | 38747251 | - |
+--------------|-----------|-----------|----------+
PyRanges object has 2 sequences from 2 chromosomes.
>>> cv = cs.coverage(stranded=True)
>>> cv
chr1 +
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
| Runs | 1541598 | 25 | 57498 | 25 | 1904886 | ... | 25 | 2952580 | 25 | 1156833 | 25 |
|--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------|
| Values | 0 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 |
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
| Runs | 7046809 | 25 | 358542 | 25 | 296582 | ... | 25 | 143271 | 25 | 156610 | 25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------|
| Values | 0 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.
>>> cv + 10.42
chr1 +
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
| Runs | 1541598 | 25 | 57498 | 25 | 1904886 | ... | 25 | 2952580 | 25 | 1156833 | 25 |
|--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------|
| Values | 10.42 | 11.42 | 10.42 | 11.42 | 10.42 | ... | 11.42 | 10.42 | 11.42 | 10.42 | 11.42 |
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
| Runs | 7046809 | 25 | 358542 | 25 | 296582 | ... | 25 | 143271 | 25 | 156610 | 25 |
|--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------|
| Values | 10.42 | 11.42 | 10.42 | 11.42 | 10.42 | ... | 11.42 | 10.42 | 11.42 | 10.42 | 11.42 |
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.
>>> bg_cv = bg.coverage()
>>> cv - bg_cv
chr1
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
| Runs | 887771 | 25 | 106864 | 25 | 46417 | ... | 25 | 730068 | 25 | 259250 | 25 |
|--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------|
| Values | 0 | -1 | 0 | -1 | 0 | ... | 1 | 0 | -1 | 0 | 1 |
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
Rle of length 247134924 containing 3242 elements
...
chrY
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
| Runs | 7046809 | 25 | 147506 | 25 | 211011 | ... | 25 | 156610 | 25 | 35191552 | 25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------|
| Values | 0 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | -1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
Rle of length 57402239 containing 60 elements
Unstranded PyRles object with 25 chromosomes.
Update: pyranges accepted in bioinformatics. See https://doi.org/10.1093/bioinformatics/btz615
(Sorry for the bump. I wanted to add some examples, plus a better description.)
What are the cliff-notes in terms of how this differs from something like https://github.com/vsbuffalo/BioRanges ?
BioRanges was never finished and I have seen no timings. PyRanges seems to be reaching feature parity with GenomicRanges soon. The greatest difference is perhaps that I try to make a dinky convenient wrapper around pandas dfs so that all the good stuff from GenomicRanges can be used on dfs while still allowing numpy/scipy/pandas to be used directly on the data to operate on it.
Anyways, great q. Something I should update the docs/README with.