How to iterate over rows of pyranges object
Entering edit mode
6 weeks ago

I'd like to apply a pyranges function on each row of a pyranges object. I then return a row that contains that region, and another region (a so-called "hit") which meets the criteria, whatever that is.

For example, I start with a pyranges object called A, read in from a BED file called A.bed:

A = pyranges.read_bed("A.bed")

I merge it to another object called M.

M = A.merge()

I'd like to run/apply a function on each row of M, which intersects back with A.

For instance, I want to get the maximum-scoring interval in A, among those intervals in A which fall in a merged row in M. (Intervals in A are disjoint.)

Here is one approach I tried, which is very slow:

#!/usr/bin/env python

import sys
import pyranges as pr
import pandas as pd
import numpy as np

in_fn = sys.argv[1]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()

def max_scoring_element_over_merged_elements(df):
    new_df = pd.DataFrame()
    for i in range(len(df)):
        pd_row = pd.DataFrame(df.iloc[i]).T.reset_index(drop=True)
        pr_row = pr.from_dict(pd_row)
        candidates = bed_data.intersect(pr_row)
        max_score = np.max(candidates.Score)
        hit = candidates[(candidates.Score == max_score)].head(1) # grab one row, in case of ties
        hit.df.columns = ["Hit{}".format(x) for x in hit.df.columns]
        pd_row = pd.concat([pd_row, hit.df], axis=1)
        new_df = new_df.append(pd_row)
    return new_df

res = bed_merged.apply(max_scoring_element_over_merged_elements)

This works on small files, but on anything like a typical input, this takes a very long time to run.

Using bedmap/bedops, for instance, this literally takes a fraction of a second to run on the command line:

$ bedmap --echo --max-element <(bedops --merge A.bed) A.bed > answer.bed

But the Python script above is using 100% CPU and takes at least longer than ten minutes (at which point I canceled it).

Is there a Pythonic/pyrange-ic/idiomatic way to efficiently do per-row operations with pyranges? Thanks!

pyranges • 327 views
Entering edit mode
5 weeks ago

I think I found a better way to do this operation:

#!/usr/bin/env python

import sys
import pyranges as pr

in_fn = sys.argv[1]
out_fn = sys.argv[2]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()
join_merged_to_all = bed_merged.join(bed_data)

def max_scoring_element(df):
    return df \
        .sort_values('Score', ascending=False) \
        .drop_duplicates(['Chromosome', 'Start', 'End', 'Strand'], keep='first') \
        .sort_index() \

# res is a pyranges object
res = join_merged_to_all.apply(max_scoring_element)

# res.df is a pandas dataframe
res.df.to_csv(out_fn, sep='\t', index=False)

This is still a bit slower than bedmap/bedops, and much more verbose/less maintainable, but being able to run ops within Python without using subprocess might be an acceptable tradeoff.

Entering edit mode

Interesting problem, and thanks for sharing a solution. What part of the above max_scoring_element_over_merged_elements() function do you think took the longest time for computation? Is it the candidates = bed_data.intersect(pr_row) part?

Entering edit mode

That intersect call probably makes it an n^2 operation at that point, as opposed to two nlogn sorts. Converting each dataframe row back to a small pyranges object is likely adding a fair bit of runtime, too, as well as using append to build a new dataframe row by row.

The solution I provide has its own issues. Sorting twice is not great, and for statistical testing, it would be useful to pick from tied-score elements at random with uniform probability, say, as opposed to picking the first element (whatever that happens to be). Dropping duplicates probably uses a ton of memory to keep a second copy of merged intervals, along with the max-scoring element association. I'm not too familiar with pandas and tricks to get the best performance, so any thoughts on how to improve this would be welcome.


Login before adding your answer.

Traffic: 990 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6