GenomicRanges for Python.
This library tries to be a thin, but extremely useful wrapper around genomic data contained in pandas dataframes. This allows for all the wonderful functionality of bedtools/bedops and/or GenomicRanges, while being able to use the the enormous universe of Python datascience libraries to manipulate and do computations on the data.
PyRanges also contains a run-length encoding library for extremely efficient arithmetic computation of scores associated with genomic intervals.
Repo: https://github.com/endrebak/pyranges
Docs: http://pyranges.readthedocs.io/
pip install pyranges # Try the examples in the docs, whydontcha
Most desired: feedback, bug reports and ideas. I do not need PR's yet as the underlying code might change greatly.
>>> import pyranges as pr
>>> cs = pr.load_dataset("chipseq")
>>> cs
+--------------|-----------|-----------|--------|---------|----------+
| Chromosome   | Start     | End       | Name   | Score   | Strand   |
|--------------|-----------|-----------|--------|---------|----------|
| chr8         | 28510032  | 28510057  | U0     | 0       | -        |
| chr7         | 107153363 | 107153388 | U0     | 0       | -        |
| chr5         | 135821802 | 135821827 | U0     | 0       | -        |
| ...          | ...       | ...       | ...    | ...     | ...      |
| chr6         | 89296757  | 89296782  | U0     | 0       | -        |
| chr1         | 194245558 | 194245583 | U0     | 0       | +        |
| chr8         | 57916061  | 57916086  | U0     | 0       | +        |
+--------------|-----------|-----------|--------|---------|----------+
PyRanges object has 10000 sequences from 24 chromosomes.
>>> bg = pr.load_dataset("chipseq_background")
>>> cs.nearest(bg, suffix="_IP")
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
| Chromosome   | Start    | End      | Name   | Score   | Strand   | Chromosome_IP   | Start_IP   | End_IP   | Name_IP   | Score_IP   | Strand_IP   | Distance   |
|--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------|
| chr1         | 1325303  | 1325328  | U0     | 0       | -        | chr1            | 1041102    | 1041127  | U0        | 0          | +           | 284176     |
| chr1         | 1541598  | 1541623  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 228760     |
| chr1         | 1599121  | 1599146  | U0     | 0       | +        | chr1            | 1770383    | 1770408  | U0        | 0          | -           | 171237     |
| ...          | ...      | ...      | ...    | ...     | ...      | ...             | ...        | ...      | ...       | ...        | ...         | ...        |
| chrY         | 21910706 | 21910731 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1353516    |
| chrY         | 22054002 | 22054027 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1496812    |
| chrY         | 22210637 | 22210662 | U0     | 0       | -        | chrY            | 20557165   | 20557190 | U0        | 0          | +           | 1653447    |
+--------------|----------|----------|--------|---------|----------|-----------------|------------|----------|-----------|------------|-------------|------------+
PyRanges object has 10000 sequences from 24 chromosomes.
>>> cs.set_intersection(bg, strandedness="opposite")
+--------------|-----------|-----------|----------+
| Chromosome   |     Start |       End | Strand   |
|--------------|-----------|-----------|----------|
| chr1         | 226987603 | 226987617 | +        |
| chr8         |  38747236 |  38747251 | -        |
+--------------|-----------|-----------|----------+
PyRanges object has 2 sequences from 2 chromosomes.
>>> cv = cs.coverage(stranded=True)
>>> cv
chr1 +
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
| Runs   |   1541598 |   25 |   57498 |   25 |   1904886 |  ...    |   25 |   2952580 |   25 |   1156833 |   25 |
|--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------|
| Values |         0 |    1 |       0 |    1 |         0 | ...     |    1 |         0 |    1 |         0 |    1 |
+--------|-----------|------|---------|------|-----------|---------|------|-----------|------|-----------|------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
| Runs   |   7046809 |   25 |   358542 |   25 |   296582 |  ...    |   25 |   143271 |   25 |   156610 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |        0 |    1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|----------|------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.
>>> cv + 10.42
chr1 +
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
| Runs   |   1541598 |    25 |   57498 |    25 |   1904886 |  ...    |    25 |   2952580 |    25 |   1156833 |    25 |
|--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------|
| Values |     10.42 | 11.42 |   10.42 | 11.42 |     10.42 | ...     | 11.42 |     10.42 | 11.42 |     10.42 | 11.42 |
+--------|-----------|-------|---------|-------|-----------|---------|-------|-----------|-------|-----------|-------+
Rle of length 247134924 containing 894 elements
...
chrY -
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
| Runs   |   7046809 |    25 |   358542 |    25 |   296582 |  ...    |    25 |   143271 |    25 |   156610 |    25 |
|--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------|
| Values |     10.42 | 11.42 |    10.42 | 11.42 |    10.42 | ...     | 11.42 |    10.42 | 11.42 |    10.42 | 11.42 |
+--------|-----------|-------|----------|-------|----------|---------|-------|----------|-------|----------|-------+
Rle of length 22210662 containing 32 elements
PyRles object with 48 chromosomes/strand pairs.
>>> bg_cv = bg.coverage()
>>> cv - bg_cv
chr1
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
| Runs   |   887771 |   25 |   106864 |   25 |   46417 |  ...    |   25 |   730068 |   25 |   259250 |   25 |
|--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------|
| Values |        0 |   -1 |        0 |   -1 |       0 | ...     |    1 |        0 |   -1 |        0 |    1 |
+--------|----------|------|----------|------|---------|---------|------|----------|------|----------|------+
Rle of length 247134924 containing 3242 elements
...
chrY
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
| Runs   |   7046809 |   25 |   147506 |   25 |   211011 |  ...    |   25 |   156610 |   25 |   35191552 |   25 |
|--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------|
| Values |         0 |    1 |        0 |    1 |        0 | ...     |    1 |        0 |    1 |          0 |   -1 |
+--------|-----------|------|----------|------|----------|---------|------|----------|------|------------|------+
Rle of length 57402239 containing 60 elements
Unstranded PyRles object with 25 chromosomes.
Update: pyranges accepted in bioinformatics. See https://doi.org/10.1093/bioinformatics/btz615
(Sorry for the bump. I wanted to add some examples, plus a better description.)
What are the cliff-notes in terms of how this differs from something like https://github.com/vsbuffalo/BioRanges ?
BioRanges was never finished and I have seen no timings. PyRanges seems to be reaching feature parity with GenomicRanges soon. The greatest difference is perhaps that I try to make a dinky convenient wrapper around pandas dfs so that all the good stuff from GenomicRanges can be used on dfs while still allowing numpy/scipy/pandas to be used directly on the data to operate on it.
Anyways, great q. Something I should update the docs/README with.