Question

How To Get Non-Overlapping Coordinates From A List That Contains Overlapping Coordinates?

4

Entering edit mode

13.0 years ago

Ian 6.0k

I apologise in advance as this may turn out to be extremely simple to solve!

I have one file that contains 200bp coordinates centred upon subpeak summits from MACS. These being in BED3 format.

As i have extended the coordinates +/- 100bp and there are some instances where the new subpeak coordinates overlap. However i want to make them nonoverlapping with the same coverage. I want to extract sequence for the coordinates and i do not want redundant sequence.

Anyone know of a quick method to parse the coordinates and alter then so, for example, the first coordinate would be made to end at the point where it overlaps with another?

I hope this is clear?!

coordinates overlap • 12k views

ADD COMMENT • link updated 13.0 years ago by Martin Morgan ★ 1.6k • written 13.0 years ago by Ian 6.0k

1

Entering edit mode

Ian, you may want to provide a sample of your data.

ADD REPLY • link 13.0 years ago by Aleksandr Levchuk 3.2k

1

Entering edit mode

this is an interesting case. it's an "anti-merge". i think there should be a tool for this in bedtools. anyone have an apt verb to describe this? The best I can do is "unOverlap".

ADD REPLY • link 13.0 years ago by Aaronquinlan 12k

0

Entering edit mode

You want to filter some coordinates out? It seems that in most cases it would not be possible reduce to a non-overlapping set while maintaining the same coverage.

ADD REPLY • link 13.0 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

Do you need some coordinates to be simply dropped out? It may not be possible to reduce to a non-overlapping set while maintaining the same coverage.

ADD REPLY • link 13.0 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

How about 'underlap' :)

ADD REPLY • link 13.0 years ago by Ian 6.0k

0

Entering edit mode

I like underlap very much.

ADD REPLY • link 12.8 years ago by Aaronquinlan 12k

score 5 · Answer 1 · 2011-04-27

5

Entering edit mode

13.0 years ago

Martin Morgan ★ 1.6k

With R / Bioconductor

library(GenomicRanges)
in = read.table("in.bed", header=FALSE)
gr0 = with(in, GRanges(V1, IRanges(V2, V3)))
gr1 = reduce(gr0)                  # this is where the work is done
out = as.data.frame(gr)[,1:3]
write.table(out, "out.bed", header=FALSE, sep="\t", quote=FALSE, 
            row.names=FALSE, col.names=FALSE)

Extending gr0 by 100 on both ends is resize(gr0, width=width(gr0) + 200, fix="center")

ADD COMMENT • link 13.0 years ago by Martin Morgan ★ 1.6k

0

Entering edit mode

Thanks for the R script, i do not code in R, but it is useful to see a practical working example!

ADD REPLY • link 13.0 years ago by Ian 6.0k

score 3 · Answer 2 · 2011-04-26

3

Entering edit mode

13.0 years ago

lh3 33k

sort -k1,1 -k2,2n in.bed|awk '{if($1!=l||$2>e){if(l)print l,b,e;l=$1;b=$2;e=$3}else e=$3}END{print l,b,e}'

ADD COMMENT • link 13.0 years ago by lh3 33k

0

Entering edit mode

Very interesting, thanks! Although i freely admit i do not know what is happening in the awk command.

ADD REPLY • link 13.0 years ago by Ian 6.0k

score 2 · Answer 3 · 2011-04-26

If I understand correctly, here is some python code that takes your sorted, original, unpadded BED3 and extends by pad basepairs unless it overlaps with the next interval.

just change the first line to the name your input BED3 and call like:

python pad.py > padded.bed

and save this into pad.py

#!/usr/bin/python
abed = "t.bed"
pad = 100

def gen_lines(bed):
    b = open(bed)
    for chrom, start, end in (l.rstrip().split("\t") for l in b):
        yield {'chrom': chrom, 'start': int(start),  'end': int(end)}

iterable = gen_lines(abed)

a = iterable.next()

for b in iterable:
    # same chr             # no overlap
    if a['chrom'] != b['chrom'] or b['start'] - pad > a['end'] + pad:
        a['start'] -= pad
        a['end']   += pad
    else:
        # overlap, adjust a and b
        dist = b['start'] - a['end']
        a['end'] += dist/2
        b['start'] -= dist/2

    a['start'] = max(0, a['start'])
    print "%(chrom)s\t%(start)d\t%(end)d" % a

    a = b
print "%(chrom)s\t%(start)d\t%(end)d" % b

score 2 · Answer 4 · 2011-04-26

2

Entering edit mode

13.0 years ago

Brad Chapman 9.7k

Ian, as far as I understand, you start with pre-padded BED coordinates like:

chr1    0       10
chr1    15      20
chr1    17      30
chr2    0       10

and want to convert it to:

chr1    0       10
chr1    15      20
chr1    20      30
chr2    0       10

where the start coordinate of the third line is readjusted to be non-overlapping.

Start by sorting your BED file with sortBed from bedtools. Then this Perl script will readjust the start coordinates based on the previous line, eliminating overlaps:

#!/usr/bin/env perl
## Parse a sorted BED3 file, avoiding overlapping coordinates
use strict;

my $infile = $ARGV[0];
chomp $infile;
open(INFILE, "<$infile");
open(OUTFILE, ">$infile.fixed");

my $last_space = "";
my $last_end = -1;
while (<INFILE>) {
  # prepare the current line
  chomp $_;
  my ($cur_space, $cur_start, $cur_end) = split("\t", $_);
  $cur_start = int($cur_start);
  $cur_end = int($cur_end);
  # if we are on the same chromosome and last end overlaps, reset our start
  if (($cur_space eq $last_space) && ($cur_start < $last_end)) {
    $cur_start = $last_end;
  }
  $cur_end >= $cur_start || die "Start greater than end";
  # output adjusted information and reset for next line
  printf OUTFILE ("%s\t%s\t%s\n", $cur_space, $cur_start, $cur_end);
  $last_space = $cur_space;
  $last_end = $cur_end;
}

ADD COMMENT • link 13.0 years ago by Brad Chapman 9.7k

1

Entering edit mode

remember the ucsc interprets the ends in bed format to be non-inclusive so chr1 15 20 chr1 21 30 actually omits position 20

ADD REPLY • link 13.0 years ago by Jeremy Leipzig 22k

1

Entering edit mode

Thanks Jeremy. That's what I get for trying to program in Perl and think at the same time. I cleaned that up and fixed the example to be 0-based instead of 1-based.

ADD REPLY • link 13.0 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks for that. A nice simple solution AND in Perl :)

ADD REPLY • link 13.0 years ago by Ian 6.0k

0

Entering edit mode

Good advice about BEDTools, but i just use:

sort -k1,1 -k2,2n file > file.sorted

ADD REPLY • link 13.0 years ago by Ian 6.0k

0

Entering edit mode

nice and easy solution, but it must be pointed out that this script assumes that the BED file regions are sorted, which isn't necessarily always the case. you'll have to check that in advance.

ADD REPLY • link 13.0 years ago by Jorge Amigo 14k

score 1 · Answer 5 · 2011-04-26

@Brad Chapman: @brentp: @lan: I　ｔｈｉｎｋ　ｙｏｕ　ｗａｎｔ　ｔｏ　ｃｏｍｂｉｎｅ　ｔｈｅｓｅ　ｏｖｅｒｌａｐ　ｒｅｇｉｏｎｓ　ｒｉｇｈｔ？Ｅｘ，　ａ　ｍｏｒｅ　ｃｏｍｐｌｉｃａｔｅd　ｅｘａｍｐｌｅ, named "test.txt"：

chr1    1       10
chr1    15      20
chr1    17      30
chr1    18      19
chr2    1       10

and want to cｏｍｂｉｎｅ　ｔｈｅ　ｏｖｅｒｌａｐ　ｒｅｇｉｏｎｓ　（ｆｉｌｔｅｒ　ｔｈｅ　redundant） it to "result.txt":

chr1    1       10
chr1    15      30
chr2    1       10

I write a awk script for it:

BEGIN{
    i=0;
    c="";
    OFS="\t";
}
{
    if (c!=$1){
        if (c != ""){
            combine(chr,c,i);
        }
        i=0;
        c="";
    }
    c=$1;
    chr[i,0]=$2;
    chr[i,1]=$3;
    i++
}
END{
    combine(chr,c,i);
}
function combine(arr,c,i) {
    j=0;
    new[j,0]=arr[0,0];
    new[j,1]=arr[0,1];
    for (k=1;k<i;k++)
    {
        if ((arr[k,0]<=new[j,1])&&(arr[k,1]>=new[j,1])){
            new[j,1]=arr[k,1];
        }
        else if (arr[k,0]>new[j,1]){
            j++;
            new[j,0]=arr[k,0];
            new[j,1]=arr[k,1];
        }
    }
    for (n=0;n<=j;n++){
        print c,new[n,0],new[n,1]
    }
}

you just need to paste the above code into a combine.awk, then type command as below in your terminal and run:

awk -f combine.awk your test.txt > result.txt

Ram · Answer 6 · 2011-04-26

0

Entering edit mode

13.0 years ago

Jeremy Leipzig 22k

This sounds a lot like one of these interval-based programming challenges:

Quick Programming Challenge: Calculate Common And Unique Regions From A List Of Chromosome Segments
Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table Of Alignment Starts And Ends?

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thanks for your reply, but i don't think either is applicable. There are plenty of solutions to both of the problem detailed in the links, but i don't see how they relate to me question.

ADD REPLY • link 13.0 years ago by Ian 6.0k