Question: How Can I Merge Intervals ?
0
gravatar for Mchimich
6.5 years ago by
Mchimich250
France
Mchimich250 wrote:

Hello everybody, I should be grateful if you would kindly help me de fix my problem. I have a table like that :

Chromosome   start   end    info1    info2   
chr01        1       100    15       35         
chr01        150     300    15       39      
chr01        299     750    16       39

I would like to merge the intervals that overlap ( line 2 and 3) and those that are closest (line 1 and 2) in addition to perform some operation basing in the other column ! for example I would like to merge the tree line above into one interval (chr1 1-750), sum basing on the info1 (15+15+16) and finally did the mean basing on the info line to (35+39+39)/3 the output I'd like will be as this :

chr1 1-750  46  37.66

I know that Bedtools can merge interval ( galaxy tool ! too )but accept only BED format that contain only 3 coloumn chr start and end !
Thanks in advance for your help !

bedtools merge • 5.2k views
ADD COMMENTlink modified 6.5 years ago by JC7.4k • written 6.5 years ago by Mchimich250

How big is your file with this table? You want to merge overlaps first and then merge closest intervals? How do you define closest?

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Damian Kao15k

Hello DK thanks for responding ! my table contain thousands of lines containing different chromosmes and severals intervals by chromosomes. What I mean by closest intervals is intervals that are separated by a given distance (D). In this case D is equal to 50 but It can be more imporant value (100 or 200). cheers

ADD REPLYlink written 6.5 years ago by Mchimich250

It can be some value makes it ill-posed.

ADD REPLYlink written 6.5 years ago by Arun2.3k

Yeah, without the score it would be easier, e.g. Using IRanges, it is still possible using normalize to join the ranges and then get the scores.

ADD REPLYlink written 6.5 years ago by Michael Dondrup45k

I don't get why all ranges in your example should be merged into one, 2,3 ok, but there's a 50 bp gap between 1 and 2.

ADD REPLYlink written 6.5 years ago by Michael Dondrup45k

Hello Michael. It will take a lot of time if I want to explain the general context of my problem. In short I want to merge line 1 and line 2 because the gap is due to the fact that 1 and 2 are tow fragment of the same sequence (but diverge from my query sequence) that I want to consider as one !

ADD REPLYlink written 6.5 years ago by Mchimich250

Ok, but if we want to do the merging automatically, we need to have concrete criteria, overlap is one possibility, annotation could be another, or a combination. So you know a priori which regions to combine?

ADD REPLYlink written 6.5 years ago by Michael Dondrup45k

If I misspoke, I apologize, but my intention was to merge the tree line together ! this is possible with bedtools using Merge nearby (within a given distance bp) but I lose information of the other columns and I can not do aditional operation

ADD REPLYlink written 6.5 years ago by Mchimich250

What Michael is trying to tell is that no software will be able to find "nearby" regions by itself. So, you need a concrete criteria. For example, say your file is

chromosome,start,end,info1,info2
chr1, 1, 500,  20, 10
chr1, 501, 555, 10, 20
chr1, 600, 699, 25, 10
chr1, 680, 800, 45, 55

Then, we can merge 600-800 and then you'd expect that to be merged with 501-555, right? So, would it be possible to then "always" merge with the immediate entry above the overlapping region (assuming that the data is coordinate sorted in chr, start and end fashion?

Also, in the above example if the second line was 450,555 instead of 501,555, then how would you want to merge?

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Arun2.3k
2
gravatar for Damian Kao
6.5 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

So basically you just want to merge features that are within X distance of each other. I assume you will predetermine X somehow. I guess an easy way is to just artificially inflate your features by X base pairs on either side and check for overlaps.

I actually just posted something today on finding consensus/overlap sequences on my blog using lexical parsing: http://blog.nextgenetics.net/?e=41

Here is a modification of the code in my blog post. It will return overlapping by X distance groups of features:

#define your distance here
padding = 50

def sortCoord(d):
    coords = []
    i = 0
    for coord in d:
        coords.append(('s',coord[0] - padding,i))
        coords.append(('e',coord[1] + padding))
        i += 1
    coords.sort(key = lambda x : x[0], reverse = True)
    coords.sort(key = lambda x : x[1])
    return coords

def clusterByOverlap(c):
    count = 0
    posA = 0
    out = []
    currentData = []
    for pos in c:
        if count == 0:
            posA = pos[1]
        if pos[0] == 's':
            count += 1
            currentData.append(pos[2])
        if pos[0] == 'e':
            count -=1
        if count == 0:
            out.append((posA + padding, pos[1] - padding, currentData))
            currentData = []

    return out

import sys
inFile = open(sys.argv[1],'r')
inFile.next()
refs = {}
for line in inFile:
    data = line.strip().split('\t')
    if not data[0] in refs.keys():
        refs[data[0]] = []
    refs[data[0]].append([int(x) for x in data[1:]])

for ref, data in refs.items():
    sortedCoords = sortCoord(data)
    clusters = clusterByOverlap(sortedCoords)
    for cluster in clusters:
        info1 = []
        info2 = []
        for i in cluster[2]:
            info1.append(data[i][2])
            info2.append(data[i][3])
        print ref + "\t" + str(cluster[0]) + "\t" + str(cluster[1]) + "\t" + str(sum(info1)) + "\t" + str(sum(info2) / float(len(info2)))

I am assuming you have a tab-delimited input table. Save as yourName.py. Use by: python yourName.py yourFile

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Damian Kao15k
1
gravatar for JC
6.5 years ago by
JC7.4k
Mexico
JC7.4k wrote:

Assuming that you have your data sorted per chromosome then by initial position and you can define a fixed window overlap to combine regions separated by gaps (region 1 and 2 in your example have 50bp separation), you can use some code like this:

#!/usr/bin/perl

use strict;
use warnings;

my ($chr, $ini, $end, $val_1, $val_2, $sum, $ave);
my $last_chr = 'na';
my $last_ini = 'na';
my $last_end = 'na';
my $ext      =   50; # extension of the overlap
my @val_1;
my @val_2;

while (<>) {
    next if(m/^Chromsome/); # skip header
    chomp;
    ($chr, $ini, $end, $val_1, $val_2) = split (/\s+/, $_);
    if ($last_chr ne $chr) { # first iteration and each chromosome change
        unless ($last_chr eq 'na') {
            $sum = doSum(@val_1);
            $ave = doAve(@val_2);
            print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
        }
        $last_chr = $chr;
        $last_ini = $ini;
        $last_end = $end;
        @val_1    = ( $val_1 );
        @val_2    = ( $val_2 );
    }
    elsif ($last_ini - $ext <= $ini and $last_end + $ext <= $end and $last_end + $ext >= $ini) { # overlapped segments
        $last_end = $end;
        push @val_1, $val_1;
        push @val_2, $val_2;
    }
    else { # no overlap case
        $sum = doSum(@val_1);
        $ave = doAve(@val_2);
        print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";
        # reset variables
        $last_ini = $ini;
        $last_end = $end;
        @val_1    = ( $val_1 );
        @val_2    = ( $val_2 );
    }
}
# last one
$sum = doSum(@val_1);
$ave = doAve(@val_2);
print "$chr\t$last_ini-$last_end\t$sum\t$ave\n";

sub doSum {
    my $sum = 0;
    foreach my $x (@_) { 
        $sum += $x; 
    }
    return $sum;
}

sub doAve {
    my $sum = 0;
    my $n   = 0;
    foreach my $x (@_) { 
        $sum += $x;
        $n++;
    }
    return $sum / $n;
}

Example output with your sample ranges:

perl mergeData.pl < sample 
chr01    1-750    46    37.6666666666667
ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by JC7.4k
1
gravatar for Alex Reynolds
4.8 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Here's a one-liner using core BEDOPS tools, taking advantage of standard input and output streams:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --range 51 --merge inputs.bed \
    | bedmap --echo --sum --mean --delim '\t' - inputs.bed \
    | awk '{ print $1"\t"$2"-"$3"\t"$4"\t"$5 }' -
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Alex Reynolds27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 762 users visited in the last hour