Question

Blast Result: Sorting non overlapping hits and choosing the extremes for overlapping ones

0

Entering edit mode

2.2 years ago

Sasha ▴ 10

I have a huge Blast output of the results like this:

1.83155e-63 24679461 24680333

7.37596e-64 24679455 24680312

9.14344e-65 24679455 24680333

6.86568e-72 24679455 24680333

6.45089e-69 24679455 24680333

4.49086e-56 24679455 24680312

1.78896e-52 35167152 35167547

2.57611e-51 35167209 35167547

The first column is e-value, second one is start of the match and third one is the end. Now I need to only choose the ones that don't overlap and if some are overlapping I need to choose only extremes. For example:

The results would be:

7:25

30:50

Thank you for any help!

Blast Biopython Python • 958 views

ADD COMMENT • link updated 2.1 years ago by Wayne ★ 2.0k • written 2.2 years ago by Sasha ▴ 10

2

Entering edit mode

2.2 years ago

Wayne ★ 2.0k

This should provide what you've asked for. It could be a little less clunky with more refining / updating to use .format() or f-strings .

import pandas as pd
df = pd.read_csv("data.txt",sep=" ",names=["e-value","start","end"])
def range_extract(lst):
    'Yield 2-tuple ranges or 1-tuple single elements from list of increasing'
    'ints; interval making code modified from'  
    'https://www.rosettacode.org/wiki/Range_extraction#Python'
    lenlst = len(lst)
    i = 0
    while i< lenlst:
        low = lst[i]
        while i <lenlst-1 and lst[i]+1 == lst[i+1]: i +=1
        hi = lst[i]
        if hi - low >= 1:    #<---MAIN DIFFERENCE
            yield (low, hi)
        else:
            yield (low,)
        i += 1

def printr(ranges):
    print( '\n'.join( (('%i:%i' % r) if len(r) == 2 else '%i' % r)
                     for r in ranges ) )
def expand_all_ranges_to_each_position(the_min,the_max):
    '''
    Takes the minimum and the max position and returns a list of all the positions
    in between as well as both the boundaries.
    '''
    return list(range(the_min,the_max+1))
all_positions = []
for row in df.itertuples():
    all_positions.extend(expand_all_ranges_to_each_position(min(row.start,row.end),max(row.start,row.end)))
all_positions = sorted(set(all_positions)) #get unique individual positions sorted; `set()` insures unique positions
for lst in [all_positions]:
    #print(list(range_extract(lst)))
    printr(range_extract(sorted(lst)))

A Jupyter notebook with related code run with some of the supplied data & example is here.
Click here to launch that notebook in an active, temporary Jupyter session served via MyBinder.org with the environment already having Python and the module Pandas, for panel-type data/dataframe-handling installed and working.

ADD COMMENT • link 2.2 years ago by Wayne ★ 2.0k

score 1 · Accepted Answer · 2022-03-16

Used a little different approach, which also seems to be working.

input = [
    (24679461, 24680333),
    (24679455, 24680312),
    (24679455, 24680333),
    (24679455, 24680333),
    (24679455, 24680333),
    (24679455, 24680312),
    (24679464, 24680333),
    (24679452, 24680333),
    (24679464, 24680339),
    (24679461, 24680333),
    (24679461, 24680333),
    (24679449, 24680333),
    (24679461, 24680333),
    (24679464, 24680333),
    (24679452, 24680333),
    (35167152, 35167547),
    (35167209, 35167547),
]

delta = 100
output = {}

# 24679449 - 24680333
# 35167152 - 35167547

for r in input:
    start = r[0]
    end = r[1]

    # if already have a range with the same start, just extend if new range is longer
    if start in output:
        print(f"Extending range: start={start}, end={max(output[start], end)}")
        output[start] = max(output[start], end)
    else:
        merged = False

        # if we don't have a range starting at the same location, look for a range that is 'delta' distance away in either direction
        for i in range(start - delta, start + delta):
            if i in output:

                # if we found a range with lower starting position, extend it if the new one is longer
                if i < start:
                    print(f"Extending upper range: start={i}, end={max(output[i], end)}")
                    output[i] = max(output[i], end)
                else:
                    # otherwise if we found a range with higher starting position we need to replace it with the starting position of the new one
                    # and keep the "longer" end (and also delete the old range which has now been "consumed")
                    print(f"Extending lower and upper range: start={start}, end={max(output[i], end)}")
                    output[start] = max(output[i], end)
                    del output[i]

                merged = True
                break

        # if we haven't merged the new range with anything just save it
        if not merged:
            print(f"Adding new range: start={start}, end={end}")
            output[start] = end

print(output)