Question: How Can I Merge Intervals ?
0
8.4 years ago by
Mchimich280
France
Mchimich280 wrote:

Hello everybody, I should be grateful if you would kindly help me de fix my problem. I have a table like that :

``````Chromosome   start   end    info1    info2
chr01        1       100    15       35
chr01        150     300    15       39
chr01        299     750    16       39
``````

I would like to merge the intervals that overlap ( line 2 and 3) and those that are closest (line 1 and 2) in addition to perform some operation basing in the other column ! for example I would like to merge the tree line above into one interval (chr1 1-750), sum basing on the info1 (15+15+16) and finally did the mean basing on the info line to (35+39+39)/3 the output I'd like will be as this :

``````chr1 1-750  46  37.66
``````

I know that Bedtools can merge interval ( galaxy tool ! too )but accept only BED format that contain only 3 coloumn chr start and end !

bedtools merge • 6.1k views
modified 8.4 years ago by JC12k • written 8.4 years ago by Mchimich280

How big is your file with this table? You want to merge overlaps first and then merge closest intervals? How do you define closest?

Hello DK thanks for responding ! my table contain thousands of lines containing different chromosmes and severals intervals by chromosomes. What I mean by closest intervals is intervals that are separated by a given distance (D). In this case D is equal to 50 but It can be more imporant value (100 or 200). cheers

It can be some value makes it ill-posed.

Yeah, without the score it would be easier, e.g. Using IRanges, it is still possible using normalize to join the ranges and then get the scores.

I don't get why all ranges in your example should be merged into one, 2,3 ok, but there's a 50 bp gap between 1 and 2.

Hello Michael. It will take a lot of time if I want to explain the general context of my problem. In short I want to merge line 1 and line 2 because the gap is due to the fact that 1 and 2 are tow fragment of the same sequence (but diverge from my query sequence) that I want to consider as one !

Ok, but if we want to do the merging automatically, we need to have concrete criteria, overlap is one possibility, annotation could be another, or a combination. So you know a priori which regions to combine?

If I misspoke, I apologize, but my intention was to merge the tree line together ! this is possible with bedtools using Merge nearby (within a given distance bp) but I lose information of the other columns and I can not do aditional operation

What Michael is trying to tell is that no software will be able to find "nearby" regions by itself. So, you need a concrete criteria. For example, say your file is

``````chromosome,start,end,info1,info2
chr1, 1, 500,  20, 10
chr1, 501, 555, 10, 20
chr1, 600, 699, 25, 10
chr1, 680, 800, 45, 55
``````

Then, we can merge 600-800 and then you'd expect that to be merged with 501-555, right? So, would it be possible to then "always" merge with the immediate entry above the overlapping region (assuming that the data is coordinate sorted in chr, start and end fashion?

Also, in the above example if the second line was 450,555 instead of 501,555, then how would you want to merge?

2
8.4 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

So basically you just want to merge features that are within X distance of each other. I assume you will predetermine X somehow. I guess an easy way is to just artificially inflate your features by X base pairs on either side and check for overlaps.

I actually just posted something today on finding consensus/overlap sequences on my blog using lexical parsing: http://blog.nextgenetics.net/?e=41

Here is a modification of the code in my blog post. It will return overlapping by X distance groups of features:

``````#define your distance here

def sortCoord(d):
coords = []
i = 0
for coord in d:
i += 1
coords.sort(key = lambda x : x[0], reverse = True)
coords.sort(key = lambda x : x[1])
return coords

def clusterByOverlap(c):
count = 0
posA = 0
out = []
currentData = []
for pos in c:
if count == 0:
posA = pos[1]
if pos[0] == 's':
count += 1
currentData.append(pos[2])
if pos[0] == 'e':
count -=1
if count == 0:
currentData = []

return out

import sys
inFile = open(sys.argv[1],'r')
inFile.next()
refs = {}
for line in inFile:
data = line.strip().split('\t')
if not data[0] in refs.keys():
refs[data[0]] = []
refs[data[0]].append([int(x) for x in data[1:]])

for ref, data in refs.items():
sortedCoords = sortCoord(data)
clusters = clusterByOverlap(sortedCoords)
for cluster in clusters:
info1 = []
info2 = []
for i in cluster[2]:
info1.append(data[i][2])
info2.append(data[i][3])
print ref + "\t" + str(cluster[0]) + "\t" + str(cluster[1]) + "\t" + str(sum(info1)) + "\t" + str(sum(info2) / float(len(info2)))
``````

I am assuming you have a tab-delimited input table. Save as yourName.py. Use by: python yourName.py yourFile

1
8.4 years ago by
JC12k
Mexico
JC12k wrote:

Assuming that you have your data sorted per chromosome then by initial position and you can define a fixed window overlap to combine regions separated by gaps (region 1 and 2 in your example have 50bp separation), you can use some code like this:

``````#!/usr/bin/perl

use strict;
use warnings;

my (\$chr, \$ini, \$end, \$val_1, \$val_2, \$sum, \$ave);
my \$last_chr = 'na';
my \$last_ini = 'na';
my \$last_end = 'na';
my \$ext      =   50; # extension of the overlap
my @val_1;
my @val_2;

while (<>) {
chomp;
(\$chr, \$ini, \$end, \$val_1, \$val_2) = split (/\s+/, \$_);
if (\$last_chr ne \$chr) { # first iteration and each chromosome change
unless (\$last_chr eq 'na') {
\$sum = doSum(@val_1);
\$ave = doAve(@val_2);
print "\$chr\t\$last_ini-\$last_end\t\$sum\t\$ave\n";
}
\$last_chr = \$chr;
\$last_ini = \$ini;
\$last_end = \$end;
@val_1    = ( \$val_1 );
@val_2    = ( \$val_2 );
}
elsif (\$last_ini - \$ext <= \$ini and \$last_end + \$ext <= \$end and \$last_end + \$ext >= \$ini) { # overlapped segments
\$last_end = \$end;
push @val_1, \$val_1;
push @val_2, \$val_2;
}
else { # no overlap case
\$sum = doSum(@val_1);
\$ave = doAve(@val_2);
print "\$chr\t\$last_ini-\$last_end\t\$sum\t\$ave\n";
# reset variables
\$last_ini = \$ini;
\$last_end = \$end;
@val_1    = ( \$val_1 );
@val_2    = ( \$val_2 );
}
}
# last one
\$sum = doSum(@val_1);
\$ave = doAve(@val_2);
print "\$chr\t\$last_ini-\$last_end\t\$sum\t\$ave\n";

sub doSum {
my \$sum = 0;
foreach my \$x (@_) {
\$sum += \$x;
}
return \$sum;
}

sub doAve {
my \$sum = 0;
my \$n   = 0;
foreach my \$x (@_) {
\$sum += \$x;
\$n++;
}
return \$sum / \$n;
}
``````

Example output with your sample ranges:

``````perl mergeData.pl < sample
chr01    1-750    46    37.6666666666667
``````
1
6.8 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Here's a one-liner using core BEDOPS tools, taking advantage of standard input and output streams:

``````\$ sort-bed inputs.unsorted.bed > inputs.bed
\$ bedops --range 51 --merge inputs.bed \
| bedmap --echo --sum --mean --delim '\t' - inputs.bed \
| awk '{ print \$1"\t"\$2"-"\$3"\t"\$4"\t"\$5 }' -``````