Question

How to iterate over rows of pyranges object

1

Entering edit mode

3.1 years ago

Alex Reynolds 35k

I'd like to apply a pyranges function on each row of a pyranges object. I then return a row that contains that region, and another region (a so-called "hit") which meets the criteria, whatever that is.

For example, I start with a pyranges object called A, read in from a BED file called A.bed:

A = pyranges.read_bed("A.bed")

I merge it to another object called M.

M = A.merge()

I'd like to run/apply a function on each row of M, which intersects back with A.

For instance, I want to get the maximum-scoring interval in A, among those intervals in A which fall in a merged row in M. (Intervals in A are disjoint.)

Here is one approach I tried, which is very slow:

#!/usr/bin/env python

import sys
import pyranges as pr
import pandas as pd
import numpy as np

in_fn = sys.argv[1]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()

def max_scoring_element_over_merged_elements(df):
    new_df = pd.DataFrame()
    for i in range(len(df)):
        pd_row = pd.DataFrame(df.iloc[i]).T.reset_index(drop=True)
        pr_row = pr.from_dict(pd_row)
        candidates = bed_data.intersect(pr_row)
        max_score = np.max(candidates.Score)
        hit = candidates[(candidates.Score == max_score)].head(1) # grab one row, in case of ties
        hit.df.columns = ["Hit{}".format(x) for x in hit.df.columns]
        pd_row = pd.concat([pd_row, hit.df], axis=1)
        new_df = new_df.append(pd_row)
    return new_df

res = bed_merged.apply(max_scoring_element_over_merged_elements)
print(res)

This works on small files, but on anything like a typical input, this takes a very long time to run.

Using bedmap/bedops, for instance, this literally takes a fraction of a second to run on the command line:

$ bedmap --echo --max-element <(bedops --merge A.bed) A.bed > answer.bed

But the Python script above is using 100% CPU and takes at least longer than ten minutes (at which point I canceled it).

Is there a Pythonic/pyrange-ic/idiomatic way to efficiently do per-row operations with pyranges? Thanks!

pyranges • 2.6k views

ADD COMMENT • link updated 2.1 years ago by markfilan ▴ 10 • written 3.1 years ago by Alex Reynolds 35k

score 2 · Answer 1 · 2021-03-28

2

Entering edit mode

3.1 years ago

Alex Reynolds 35k

I think I found a better way to do this operation:

#!/usr/bin/env python

import sys
import pyranges as pr

in_fn = sys.argv[1]
out_fn = sys.argv[2]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()
join_merged_to_all = bed_merged.join(bed_data)

def max_scoring_element(df):
    return df \
        .sort_values('Score', ascending=False) \
        .drop_duplicates(['Chromosome', 'Start', 'End', 'Strand'], keep='first') \
        .sort_index() \
        .reset_index(drop=True)

# res is a pyranges object
res = join_merged_to_all.apply(max_scoring_element)

# res.df is a pandas dataframe
res.df.to_csv(out_fn, sep='\t', index=False)

This is still a bit slower than bedmap/bedops, and much more verbose/less maintainable, but being able to run ops within Python without using subprocess might be an acceptable tradeoff.

ADD COMMENT • link 3.1 years ago by Alex Reynolds 35k

0

Entering edit mode

Interesting problem, and thanks for sharing a solution. What part of the above max_scoring_element_over_merged_elements() function do you think took the longest time for computation? Is it the candidates = bed_data.intersect(pr_row) part?

ADD REPLY • link 3.1 years ago by sim.j.baum ▴ 140

1

Entering edit mode

That intersect call probably makes it an n^2 operation at that point, as opposed to two nlogn sorts. Converting each dataframe row back to a small pyranges object is likely adding a fair bit of runtime, too, as well as using append to build a new dataframe row by row.

The solution I provide has its own issues. Sorting twice is not great, and for statistical testing, it would be useful to pick from tied-score elements at random with uniform probability, say, as opposed to picking the first element (whatever that happens to be). Dropping duplicates probably uses a ton of memory to keep a second copy of merged intervals, along with the max-scoring element association. I'm not too familiar with pandas and tricks to get the best performance, so any thoughts on how to improve this would be welcome.

ADD REPLY • link 3.1 years ago by Alex Reynolds 35k

score 0 · Answer 2 · 2022-03-23

First consider if you really need to iterate over rows in a DataFrame. Iterating through pandas dataFrame objects is generally slow. Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterate through DataFrame.

Pandas DataFrame loop using list comprehension

result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Pandas DataFrame loop using DataFrame.apply()

result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)