BED file: Find intervals that overlap a certain percentage and keep the longest one
2
1
Entering edit mode
2.9 years ago
biomonte ▴ 220

I have a BED file with some elements that are located in overlapped intervals. I added unique IDs, and length in the 4th and 5th columns, respectively.

cat sorted_intervals.bed
[....]
NC_007117.7   52869911  52875049    NC_007117.7.27   5138
NC_007117.7   52869911  52870819    NC_007117.7.28   908
NC_007117.7   52869929  52870807    NC_007117.7.29   878
NC_007117.7   52869932  52870798    NC_007117.7.30   866
NC_007117.7   52869932  52870795    NC_007117.7.31   863
NC_007117.7   52869932  52870780    NC_007117.7.32   848
NC_007117.7   52869938  52870804    NC_007117.7.33   866
NC_007117.7   52869956  52870795    NC_007117.7.34   839
NC_007117.7   52874159  52875088    NC_007117.7.35   929
NC_007117.7   52874159  52875088    NC_007117.7.36   929
NC_007117.7   52874159  52875088    NC_007117.7.37   929
NC_007117.7   52874159  52875088    NC_007117.7.38   929
NC_007117.7   52874159  52875082    NC_007117.7.39   923
NC_007117.7   52874159  52875088    NC_007117.7.40   929
NC_007117.7   52874159  52875079    NC_007117.7.41   920
NC_007117.7   52874159  52875088    NC_007117.7.42   929
NC_007117.7   52874159  52875088    NC_007117.7.43   929
NC_007117.7   52874162  52875085    NC_007117.7.44   923
NC_007117.7   52874192  52875052    NC_007117.7.45   860
NC_007117.7   52874192  52875079    NC_007117.7.46   887
NC_007117.7   52874192  52875052    NC_007117.7.47   860
[....]

My goal is to find which elements are located in overlapping genomic regions and then select the one that spans the longest length. I also want to keep those elements that do not overlap with any others. To do so, in a previous analysis, I used the following command:

while read LINE ; do grep -wE "$LINE" sorted_intervals.bed | sort -k5,5nr | head -1 ; done < <(bedtools merge -i sorted_intervals.bed -c 4 -o count,collapse | awk '{print $5}' | sed 's/,/|/g')

which successfully outputs the element NC_007117.7.27 for the above example.

Now I want to repeat the same analysis and set up a cutoff value (let's say 10%) for the overlapping regions. That is, if two elements have overlap >=10%, keep the longest one; but if two elements have overlap <10%, keep both of them. Same as before, I also would like to keep elements that do not overlap with any other elements.

I have found similar questions, such as:

Bed file: merging intervals that overlap a certain percentage

multiple bed - merge regions IF overlapping more than xx percent of size

However, I am still not very sure how to do it, because when I run the following bedops command it gives me three different overlaps for the above example instead of just one as in bedtools merge. I also tried the option --fraction-either and it produced the same results. I think I might be misunderstanding something.

bedmap --count --echo-map-range --echo-map-id --fraction-both 0.1 --delim '\t' sorted_intervals.bed | uniq | grep "NC_007117.7.27"
21  NC_007117.7 52869911    52875088    NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46
8   NC_007117.7 52869911    52875049    NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34
14  NC_007117.7 52869911    52875088    NC_007117.7.27;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46      

Does anyone know what would be the best way to proceed?

Many thanks,

EDIT: I don't think I need the two elements to have a reciprocal overlapping percentage of 10%, just the smallest one among the two being compared. That is, if the smallest of the two elements has an overlap that is less than 10%, then both elements should be kept.

Bed bedtools bash bedops • 4.2k views
ADD COMMENT
2
Entering edit mode
2.9 years ago

Based on the sketch provided here:

Example

Here is a script that I used to evaluate overlaps two levels down, first looking for the longest interval in a merged region, and then looking at overlaps with intervals overlapping that longest interval.

This runs entirely within Python, no use of bedops or other command-line kits. This uses the ncls library, which has some Cython optimizations for fast interval overlap queries, along with PyRanges to do the first-pass merge operation.

Notes:

  1. This is not recursive; some more complicated overlap arrangements might need a recursive approach, depending on how your overlap criteria must be applied.

  2. This would need to be run on one chromosome's worth of data at a time, unless coordinates are first translated to an "absolute" coordinate scheme as described in a previous comment. The bedextract tool can be used to quickly split a BED file into separate chromosomes, e.g.,

     for chr in `bedextract --list-chr in.bed`; do bedextract ${chr} in.bed > in.${chr}.bed; done
    

Code:

#!/usr/bin/env python

import sys
import io
import click
import ncls
import pandas as pd
import pyranges as pr
import numpy as np
import collections

test_intervals_str = '''Chromosome  Start   End Id
chr1    50  80  F
chr1    100 120 D
chr1    100 140 C
chr1    120 200 A
chr1    199 260 B
'''

def df_from_intervals_str(str):
    '''
    In:
    str - (string) TSV-formatted intervals with header

    Out:
    df - (Pandas dataframe)
    '''
    intervals = io.StringIO(str)
    return pd.read_csv(intervals, sep='\t', header=0)

def ncl_from_df(df):
    '''
    In:
    df - (Pandas dataframe)

    Out:
    df_as_ncl - (NCLS object)
    '''
    return ncls.NCLS(df['Start'], df['End'], df.index)

def ncl_all_overlaps(a, b):
    '''
    In:
    a - (NCLS object) set A
    b - (Pandas dataframe) set B

    Out:
    (l_idxs, r_idxs) - (tuple) indices of set A and set B which overlap
    '''
    return a.all_overlaps_both(b['Start'].values, b['End'].values, b.index.values) 

def test_ncl_all_overlaps(a, b):
    '''
    In:
    a - (NCLS object) set A
    b - (Pandas dataframe) set B
    '''
    print(ncl_all_overlaps(a, b))

def ovr_idxs_to_map(ovr):
    '''
    In:
    ovr - (tuple) output from ncl_all_overlaps

    Out:
    m - (OrderedDict) mapping of overlaps by indices
    '''
    m = collections.OrderedDict()
    for l_idx, r_idx in zip(ovr[0], ovr[1]):
        if l_idx not in m:
            m[l_idx] = []
        m[l_idx].append(r_idx)
    return m

def search_candidates(m, a, b, t):
    '''
    In:
    m - (OrderedDict) mapping of overlaps by indices
    a - (Pandas dataframe) set A (merged regions)
    b - (Pandas dataframe) set B (input intervals)
    t - (float) threshold for overlap

    Accepted elements are written to standard output stream
    '''
    for ak, bv in m.items():
        if len(bv) == 1:
            # print disjoint element
            b_row = b.iloc[[bv[0]]]
            sys.stdout.write('{}\n'.format('\t'.join([str(x) for x in b_row.values.flatten().tolist()])))
        else:
            # get set A's merged region length
            a_row = a.iloc[[ak]]
            a_len = a_row['End'].item() - a_row['Start'].item()
            # iterate through set B to get the first, longest overlap
            mo = 0
            mi = -1
            mv = -1
            for i, v in enumerate(bv):
                b_row = b.iloc[[v]]
                co = (b_row['End'].item() - b_row['Start'].item()) / a_len
                if co > mo:
                    mo = co
                    mi = i
                    mv = v
            # if the longest element does not meet the overlap 
            # threshold, we skip to the next merged region
            if mo <= t:
                continue
            # otherwise, examine a list of candidates
            candidate_idxs = bv.copy()
            accepted_idxs = [mv]
            candidate_idxs.pop(mi)
            # mv is the index of the item in set B that we now test against the 
            # remaining elements in the candidate list
            pv = mv
            parent_df = b.iloc[[pv]]
            children_df_as_ncl = ncl_from_df(b.iloc[candidate_idxs, :])
            children_ovr_parent = ncl_all_overlaps(children_df_as_ncl, parent_df)
            children_ovr_parent_map = ovr_idxs_to_map(children_ovr_parent)
            # test overlaps of children with longest-overlapping parent
            p_row = b.iloc[[pv]]
            candidate_idxs_to_remove = []
            for ci, cv in enumerate(candidate_idxs):
                c_row = b.iloc[[cv]]
                c_len = c_row['End'].item() - c_row['Start'].item()
                # remove any candidates which do not overlap the parent -- these may 
                # have originally overlapped the merged regions we started with
                if cv not in children_ovr_parent_map[pv]:
                    candidate_idxs_to_remove.append(ci)
                    continue
                # measure overlap, per criteria in sketch
                if (c_row['Start'].item() < p_row['Start'].item()) and (c_row['End'].item() < p_row['End'].item()):
                    l = p_row['Start'].item()
                    r = c_row['End'].item()
                elif (c_row['Start'].item() < p_row['End'].item()) and (c_row['End'].item() > p_row['End'].item()):
                    l = c_row['Start'].item()
                    r = p_row['End'].item()
                else:
                    # child element is nested within the parent
                    candidate_idxs_to_remove.append(ci)
                    continue
                # calculate overlap, relative to child element
                o = (r - l) / c_len
                # if child element coverage is *less* than threshold, include it
                if o < t:
                    accepted_idxs.append(cv)
                # either way, we remove it from further consideration
                candidate_idxs_to_remove.append(ci)
            # make sure that we have no children left to test
            # if we have any candidate children left, something went wrong
            assert(len(candidate_idxs) == len(candidate_idxs_to_remove))
            # print accepted elements
            for acc_idx in accepted_idxs:
                acc_row = b.iloc[[acc_idx]]
                sys.stdout.write('{}\n'.format('\t'.join([str(x) for x in acc_row.values.flatten().tolist()])))

@click.command()
@click.option('--threshold', type=float, default=0.1, help='overlap threshold')
def main(threshold):
    # validate parameter
    assert((threshold > 0) and (threshold < 1))
    # import data
    df = df_from_intervals_str(test_intervals_str)
    # validate input -- only one chromosome at a time
    assert(len(df.Chromosome.unique()) == 1)
    df_as_ncl = ncl_from_df(df)
    gr = pr.from_string(test_intervals_str)
    mf = gr.merge().as_df()
    # convert column types to make compatible with ncls
    mf['Start'] = mf['Start'].astype(np.int64)
    mf['End'] = mf['End'].astype(np.int64)
    # associate intervals with merged regions by indices (analogous to bedmap)
    mf_ovr_df = ncl_all_overlaps(df_as_ncl, mf)
    mf_ovr_df_map = ovr_idxs_to_map(mf_ovr_df)
    # search through associations for those that meet overlap criteria
    search_candidates(mf_ovr_df_map, mf, df, threshold)

if __name__ == '__main__':
    main()

Some examples of running it with different thresholds:

$ ./biostars9470750.py --threshold=0.1
chr1    50  80  F
chr1    120 200 A
chr1    199 260 B
$ ./biostars9470750.py --threshold=0.5
chr1    50  80  F
$ ./biostars9470750.py --threshold=0.01
chr1    50  80  F
chr1    120 200 A
ADD COMMENT
0
Entering edit mode

Many thanks for all your effort, Alex. I really appreciate it and I think your script would be extremely useful for my work. However, I am not sure if I understand why element "A" (chr1:120-200) is not picked when the threshold is set to 0.5. Based on my understanding, I think that in this situation, element "A" should be picked regardless of the threshold because it constitutes the longest element (the "parent") in the overlap region (**but see note below), and the threshold should be applied to children overlaps of element "A" (but not to element "A" itself). Am I understanding it correctly? I am so sorry for the trouble.

**Note: It would be very convenient if a column with ranks could be used to solve ties when choosing the parent, as previously discussed. That is, element "A" would be the one picked unless there are two or more elements with the same length as "A" in the overlap region, in which case the one with the lowest rank number should be picked (or one at random should be picked in the case that two or more tied parent elements have the same ranks).

ADD REPLY
1
Entering edit mode
2.9 years ago

My goal is to find which elements are located in overlapping genomic regions and then select the one that spans the longest length... Now I want to repeat the same analysis and set up a cutoff value (let's say 10%) for the overlapping regions. That is, if two elements have overlap >=10%, keep the longest one; but if two elements have overlap <10%, keep both of them. Same as before, I also would like to keep elements that do not overlap with any other elements

Perhaps pipe overlap lengths to a script, and work with your custom logic there.

For instance, let's start with a simpler example input (in.bed, sorted per sort-bed):

chr1    50  80 .
chr1    100 110 .
chr1    105 115 .
chr1    110 140 .
chr1    120 160 .
chr1    120 200 .
chr1    130 160 .

This is just BED4; it doesn't matter for our purposes how many columns there are in the input regions. The process below ignores everything but the first three columns. We just need it to be sorted per sort-bed.

Next, merge the inputs, get overlap size values for each merged region, and pipe the results to a script called threshold.py (shown below):

$ bedops --merge in.bed \
    | bedmap --echo-map --echo-map-size --echo-ref-size - in.bed \
    | ./threshold.py --threshold=0.1 \
    | sort-bed - \
    > answer.bed

The --echo-map option in bedmap is not guaranteed to write sorted BED, and threshold.py may pick elements in different orders depending on thresholding, so we need to pipe to sort-bed before writing to answer.bed (if sorted output is desired).

Here's threshold.py:

#!/usr/bin/env python

import sys
import click
import numpy as np

@click.command()
@click.option('--threshold', type=float, default=0.1, help='overlap threshold')
def main(threshold):
    for l in sys.stdin:
        (maps_str, maps_lens_str, ref_len_str) = l.rstrip('\n').split('|')
        maps = [x for x in maps_str.split(';')]
        ref_len = int(ref_len_str)
        maps_lens = [int(x) for x in maps_lens_str.split(';')]
        overlaps = [float(x) / ref_len for x in maps_lens]
        count_above_threshold = sum([1 for o in overlaps if o >= threshold])
        if count_above_threshold == 1:
            sys.stdout.write('{}\n'.format(maps[0]))
        elif count_above_threshold > 2:
            idx_of_longest_element = np.argmax(maps_lens)
            sys.stdout.write('{}\n'.format(maps[idx_of_longest_element]))
        else:
            for idx in np.argwhere(np.array(overlaps) < threshold):
                sys.stdout.write('{}\n'.format(maps[idx[0]]))

if __name__ == '__main__':
    main()

Logic: The if block asks how many elements overlap a merged region:

  1. If only one element overlaps the merged region, it is a disjoint element (does not overlap any other elements) and is printed as-is.
  2. If more than two elements overlap the merged region by some threshold, which you can specify via --threshold=<frac>, then the longest overlapping element over the merged region is written to output.
  3. If only two elements overlap by less than the specified threshold, print out both.
ADD COMMENT
1
Entering edit mode

Many thanks for your detailed answer Alex - It really helped me a lot to sort out my thoughts!

I will mark it as the accepted answer because it solves exactly what I asked for. However, I think I have found one particular situation in which the script is not able to give the right answer. That is, when there is two longest elements that have the same length and they both have an overlap with each other above the threshold, like so:

chr1    50  80  .
chr1    100 200 .
chr1    150 250 .
chr1    240 290 .

In this situation, it seems that the script reports the first and the second elements, but what happens to the third? I think the script is not able to chose between the second or the third element as the longest because they are the same length. In order decide which of the two longest elements should be selected, I would like the script to use an extra column that contains a rank number that can go from 1 to 10 in each element (being the lower, the better). For example, in the following BED file the third element should be picked because it has a smaller rank number than the second element. So, the result would include the 1st and the 3rd elements but not the 2nd or the 4th. Does that make sense to you?

chr1    50  80  .  3
chr1    100 200 .  3
chr1    150 250 .  2
chr1    240 290 .  3

One issue about this is that the ranks I have are independent (they result from a previous BLAST analysis that I did). So, I think there is a small chance that the two (or more) longest elements can have the same rank. In such case, I guess just picking one of the longest elements at random (maybe the first one or the last one) would be fine.

Would that be something possible to implement?

I think so, but I don't really know how to modify your python script to do it. If you could have some time to help me with that, I would be very grateful. I am sorry for the inconvenience.

ADD REPLY
1
Entering edit mode

If you keep ranks in the fifth column, you could use them for processing ties with a modification to the if-logic like this, perhaps:

#!/usr/bin/env python

import sys
import click
import numpy as np

@click.command()
@click.option('--threshold', type=float, default=0.1, help='overlap threshold')
def main(threshold):
    for l in sys.stdin:
        (maps_str, maps_lens_str, ref_len_str) = l.rstrip('\n').split('|')
        maps = [x for x in maps_str.split(';')]
        ref_len = int(ref_len_str)
        maps_lens = [int(x) for x in maps_lens_str.split(';')]
        overlaps = [float(x) / ref_len for x in maps_lens]
        count_above_threshold = sum([1 for o in overlaps if o >= threshold])
        if count_above_threshold == 1:
            sys.stdout.write('{}\n'.format(maps[0]))
        elif count_above_threshold > 2:
            # extract indices of all longest elements
            idxs_of_longest_elements = np.argwhere(maps_lens == np.max(maps_lens)).flatten().tolist()
            # extract ranks from ties
            tied_ranks = np.array([float(maps[i].split('\t')[4]) for i in idxs_of_longest_elements])
            # get random index of smallest of tied ranks (lower rank is better)
            tiebreak = np.random.choice(np.flatnonzero(tied_ranks == np.min(tied_ranks)))
            sys.stdout.write('{}\n'.format(maps[idxs_of_longest_elements[tiebreak]]))
        else:
            for idx in np.argwhere(np.array(overlaps) < threshold):
                sys.stdout.write('{}\n'.format(maps[idx[0]]))

if __name__ == '__main__':
    main()

If I set the rank of chr1:150-250 to 3, for instance, this script will randomly pick between that and the previous element on multiple runs.

ADD REPLY
1
Entering edit mode

Many thanks for this Alex, people like you make Biostars a great place!

I have tried this modification and I have one last question - In the case there was an element (for example, chr1:90-120) overlapping the longest element that is not being picked due to poor rank (chr1:100-200) but with no overlap to the other long element that is picked (chr1:150-250) - What would happen? I am not sure if I am correct, but my logic tells me that it should be picked. However, I have tested this scenario, and the element chr1:90-120 was not picked. Am I missing something? I think those would be extreme cases anyway, but I like to consider all the possibilities.

chr1    50  80  .   3
chr1    90  120 .   3
chr1    100 200 .   3
chr1    150 250 .   2
chr1    240 290 .   3

Again, really sorry to make you lose time with this.

ADD REPLY
1
Entering edit mode

In the case above, the merged regions are chr1:50-80 and chr1:90-290.

We print chr1:50-80 on its own (it is disjoint).

For the second merged region, there are four elements which have more than 10% overlap with this merged region (all but the disjoint element).

From the selection logic, we look for the longest elements and pick one with the best rank, or select randomly from ties.

So to answer your question as to what happens, the script as it is would pick chr1:150-250. It doesn't consider mutual overlaps within the merged region, or how to deal with them.

To go further and decide between elements within the merged region that are mutually disjoint or overlapping, one possibility might be to do a first-pass on selection with the approaches already discussed. Then take the first-pass output and re-run it against the original intervals, in order to find those which do not overlap.

You could catch chr1:90-120 that way, since it does not overlap chr1:150-250. However, you could theoretically catch a lot of other elements that overlap chr1:90-120, as well, so you might need to run things through a second time to clean out new overlaps. This seems hacky to me, but for what you're doing, maybe a couple passes is workable.

If you need a more complicated ruleset, another option might be to map the elements file against itself, using --count and --echo-map to output the number of overlapping elements within a set, along with those elements.

An element will overlap itself and so all counts are minimally 1. If an element overlaps any others, that count is incremented accordingly:

$ bedmap --echo --count in.bed
chr1    50  80  .   3|1
chr1    90  120 .   3|2
chr1    100 200 .   2|3
chr1    150 250 .   2|3
chr1    240 290 .   3|2

You can go further and add --echo-map to get those elements which make up the overlap:

$ bedmap --echo --count --echo-map in.bed
chr1    50  80  .   3|1|chr1    50  80  .   3
chr1    90  120 .   3|2|chr1    90  120 .   3;chr1  100 200 .   2
chr1    100 200 .   2|3|chr1    90  120 .   3;chr1  100 200 .   2;chr1  150 250 .   2
chr1    150 250 .   2|3|chr1    100 200 .   2;chr1  150 250 .   2;chr1  240 290 .   3
chr1    240 290 .   3|2|chr1    150 250 .   2;chr1  240 290 .   3

At this point, you would modify the parsing steps and selection logic, using the separate count and overlap size result to add further criteria for which subset of elements you prioritize for evaluation and selection.

At this point, you might step away from streams and write things to discrete intermediate files, processing them together with a script to get the desired result.

There's not much out-of-the-box that I know of that will do this easily. I've played with PyRanges, but I don't think it will solve this directly, either. It has some good bedops-like overlap calculation functionality, but not much in the way of mapping or running functions on associations of subsets of elements, which appears to be what you're trying to do here. Still, I hope this was useful, anyway.

ADD REPLY
0
Entering edit mode

After revising your python script as carefully as possible, I realized that it is the total length of the merged region that is being taken into account to calculate the overlap percentage of each element. However, now I finally see that this is not exactly what I was aiming for (I am really sorry for not explaining everything clearly in the first place - but the truth is that without your explanations I could not have figured this out). What I wanted to find is, first, the longest element in the merged region (considering ranks mentioned above in case there are two or more longest elements), and then, if there is an element that overlaps the one being picked, check if it overlaps over the threshold. Please see my drawing:

Example

In this scenario, the merged region would be chr1:100-260 and I understand why the script would only pick element A because it is the longest one that is above the 10% threshold if we are accounting for the total length of the merged region (80/160 > 0.1). However, I would like to account for the length of the shorter sequences that overlap element A instead. So in the above scenario, element B should also be recovered because 1/61 < 0.1, but not element C because 20/40 > 0.1. Element D would be ignored because it doesn't overlap element A. This way I think would allow me recover not only the longest element in the merged region but also any tandem duplicated genes that are next to it (this is my final purpose).

ADD REPLY
1
Entering edit mode

It might be useful to look at interval tree packages that will put genomic elements into a tree-like structure.

To deal with multiple chromosomes, the intervals could either be translated from chr1, chr10, chr11, etc. to an absolute coordinate system, based on the cumulative sizes of chromosomes; or each chromosome of intervals could be handled separately, which could be useful for parallelization, if you have a lot of intervals.

The procedure would be a depth-first search through merged regions: https://en.wikipedia.org/wiki/Depth-first_search

After putting intervals into a tree, merge intervals. Iterate over each merged interval. Within each merged interval, test all overlapping intervals for having their own overlaps, and either apply the selection criteria (apply an overlap threshold, where overlap is based on parent-child overlap) when an interval has no children overlaps, or recursively test if those overlaps also have children (depth-first search).

There's a Python package here that has methods for building an interval tree, merging overlaps, and measuring overlaps: https://github.com/chaimleib/intervaltree

It's an interesting problem. I might take a stab at it this weekend if I can get some time.

ADD REPLY
0
Entering edit mode

Thanks a lot Alex, it sounds like a much more challenging problem than expected. I don't think I have the knowledge to solve it unfortunately. If you get some time, I will be looking forward to hear from you whenever you can (no hurries). I really appreciate all your help, and I am sorry to waste your time.

ADD REPLY
0
Entering edit mode

I looked into the intervaltree package, but it appears to have some problem with it finding overlapping elements, so I skipped it. The ncls package may end up being faster for whole-genome scale work, in any case.

ADD REPLY

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6