How To Get Non-Overlapping Coordinates From A List That Contains Overlapping Coordinates?
6
4
Entering edit mode
10.5 years ago
Ian 5.8k

I apologise in advance as this may turn out to be extremely simple to solve!

I have one file that contains 200bp coordinates centred upon subpeak summits from MACS. These being in BED3 format.

As i have extended the coordinates +/- 100bp and there are some instances where the new subpeak coordinates overlap. However i want to make them nonoverlapping with the same coverage. I want to extract sequence for the coordinates and i do not want redundant sequence.

Anyone know of a quick method to parse the coordinates and alter then so, for example, the first coordinate would be made to end at the point where it overlaps with another?

I hope this is clear?!

coordinates overlap • 10k views
1
Entering edit mode

Ian, you may want to provide a sample of your data.

1
Entering edit mode

this is an interesting case. it's an "anti-merge". i think there should be a tool for this in bedtools. anyone have an apt verb to describe this? The best I can do is "unOverlap".

0
Entering edit mode

You want to filter some coordinates out? It seems that in most cases it would not be possible reduce to a non-overlapping set while maintaining the same coverage.

0
Entering edit mode

Do you need some coordinates to be simply dropped out? It may not be possible to reduce to a non-overlapping set while maintaining the same coverage.

0
Entering edit mode

0
Entering edit mode

I like underlap very much.

5
Entering edit mode
10.5 years ago
Martin Morgan ★ 1.6k

With R / Bioconductor

library(GenomicRanges)
gr0 = with(in, GRanges(V1, IRanges(V2, V3)))
gr1 = reduce(gr0)                  # this is where the work is done
out = as.data.frame(gr)[,1:3]
row.names=FALSE, col.names=FALSE)


Extending gr0 by 100 on both ends is resize(gr0, width=width(gr0) + 200, fix="center")

0
Entering edit mode

Thanks for the R script, i do not code in R, but it is useful to see a practical working example!

3
Entering edit mode
10.5 years ago
lh3 32k

sort -k1,1 -k2,2n in.bed|awk '{if($1!=l||$2>e){if(l)print l,b,e;l=$1;b=$2;e=$3}else e=$3}END{print l,b,e}'

0
Entering edit mode

Very interesting, thanks! Although i freely admit i do not know what is happening in the awk command.

2
Entering edit mode
10.5 years ago
brentp 23k

If I understand correctly, here is some python code that takes your sorted, original, unpadded BED3 and extends by pad basepairs unless it overlaps with the next interval.

just change the first line to the name your input BED3 and call like:

python pad.py > padded.bed


#!/usr/bin/python
abed = "t.bed"

def gen_lines(bed):
b = open(bed)
for chrom, start, end in (l.rstrip().split("\t") for l in b):
yield {'chrom': chrom, 'start': int(start),  'end': int(end)}

iterable = gen_lines(abed)

a = iterable.next()

for b in iterable:
# same chr             # no overlap
if a['chrom'] != b['chrom'] or b['start'] - pad > a['end'] + pad:
else:
# overlap, adjust a and b
dist = b['start'] - a['end']
a['end'] += dist/2
b['start'] -= dist/2

a['start'] = max(0, a['start'])
print "%(chrom)s\t%(start)d\t%(end)d" % a

a = b
print "%(chrom)s\t%(start)d\t%(end)d" % b

2
Entering edit mode
10.5 years ago

chr1    0       10
chr1    15      20
chr1    17      30
chr2    0       10


and want to convert it to:

chr1    0       10
chr1    15      20
chr1    20      30
chr2    0       10


where the start coordinate of the third line is readjusted to be non-overlapping.

Start by sorting your BED file with sortBed from bedtools. Then this Perl script will readjust the start coordinates based on the previous line, eliminating overlaps:

#!/usr/bin/env perl
## Parse a sorted BED3 file, avoiding overlapping coordinates
use strict;

my $infile =$ARGV[0];
chomp $infile; open(INFILE, "<$infile");
open(OUTFILE, ">$infile.fixed"); my$last_space = "";
my $last_end = -1; while (<INFILE>) { # prepare the current line chomp$_;
my ($cur_space,$cur_start, $cur_end) = split("\t",$_);
$cur_start = int($cur_start);
$cur_end = int($cur_end);
# if we are on the same chromosome and last end overlaps, reset our start
if (($cur_space eq$last_space) && ($cur_start <$last_end)) {
$cur_start =$last_end;
}
$cur_end >=$cur_start || die "Start greater than end";
# output adjusted information and reset for next line
printf OUTFILE ("%s\t%s\t%s\n", $cur_space,$cur_start, $cur_end);$last_space = $cur_space;$last_end = $cur_end; }  ADD COMMENT 1 Entering edit mode remember the ucsc interprets the ends in bed format to be non-inclusive so chr1 15 20 chr1 21 30 actually omits position 20 ADD REPLY 1 Entering edit mode Thanks Jeremy. That's what I get for trying to program in Perl and think at the same time. I cleaned that up and fixed the example to be 0-based instead of 1-based. ADD REPLY 0 Entering edit mode Thanks for that. A nice simple solution AND in Perl :) ADD REPLY 0 Entering edit mode Good advice about BEDTools, but i just use: sort -k1,1 -k2,2n file > file.sorted ADD REPLY 0 Entering edit mode nice and easy solution, but it must be pointed out that this script assumes that the BED file regions are sorted, which isn't necessarily always the case. you'll have to check that in advance. ADD REPLY 1 Entering edit mode 10.5 years ago Benm ▴ 710 @Brad Chapman: @brentp: @lan: I ｔｈｉｎｋ ｙｏｕ ｗａｎｔ ｔｏ ｃｏｍｂｉｎｅ ｔｈｅｓｅ ｏｖｅｒｌａｐ ｒｅｇｉｏｎｓ ｒｉｇｈｔ？ Ｅｘ， ａ ｍｏｒｅ ｃｏｍｐｌｉｃａｔｅd ｅｘａｍｐｌｅ, named "test.txt"： chr1 1 10 chr1 15 20 chr1 17 30 chr1 18 19 chr2 1 10  and want to cｏｍｂｉｎｅ ｔｈｅ ｏｖｅｒｌａｐ ｒｅｇｉｏｎｓ （ｆｉｌｔｅｒ ｔｈｅ redundant） it to "result.txt": chr1 1 10 chr1 15 30 chr2 1 10  I write a awk script for it: BEGIN{ i=0; c=""; OFS="\t"; } { if (c!=$1){
if (c != ""){
combine(chr,c,i);
}
i=0;
c="";
}
c=$1; chr[i,0]=$2;
chr[i,1]=\$3;
i++
}
END{
combine(chr,c,i);
}
function combine(arr,c,i) {
j=0;
new[j,0]=arr[0,0];
new[j,1]=arr[0,1];
for (k=1;k<i;k++)
{
if ((arr[k,0]<=new[j,1])&&(arr[k,1]>=new[j,1])){
new[j,1]=arr[k,1];
}
else if (arr[k,0]>new[j,1]){
j++;
new[j,0]=arr[k,0];
new[j,1]=arr[k,1];
}
}
for (n=0;n<=j;n++){
print c,new[n,0],new[n,1]
}
}


you just need to paste the above code into a combine.awk, then type command as below in your terminal and run:

awk -f combine.awk your test.txt > result.txt

0
Entering edit mode

0
Entering edit mode

I don't think this is an issue with my data, but looks very useful for other data sets.

0
Entering edit mode
10.5 years ago

This sounds a lot like one of these interval-based programming challenges:

0
Entering edit mode

Thanks for your reply, but i don't think either is applicable. There are plenty of solutions to both of the problem detailed in the links, but i don't see how they relate to me question.