How To Get Non-Overlapping Coordinates From A List That Contains Overlapping Coordinates?
6
4
Entering edit mode
13.0 years ago
Ian 6.0k

I apologise in advance as this may turn out to be extremely simple to solve!

I have one file that contains 200bp coordinates centred upon subpeak summits from MACS. These being in BED3 format.

As i have extended the coordinates +/- 100bp and there are some instances where the new subpeak coordinates overlap. However i want to make them nonoverlapping with the same coverage. I want to extract sequence for the coordinates and i do not want redundant sequence.

Anyone know of a quick method to parse the coordinates and alter then so, for example, the first coordinate would be made to end at the point where it overlaps with another?

I hope this is clear?!

coordinates overlap • 12k views
ADD COMMENT
1
Entering edit mode

Ian, you may want to provide a sample of your data.

ADD REPLY
1
Entering edit mode

this is an interesting case. it's an "anti-merge". i think there should be a tool for this in bedtools. anyone have an apt verb to describe this? The best I can do is "unOverlap".

ADD REPLY
0
Entering edit mode

You want to filter some coordinates out? It seems that in most cases it would not be possible reduce to a non-overlapping set while maintaining the same coverage.

ADD REPLY
0
Entering edit mode

Do you need some coordinates to be simply dropped out? It may not be possible to reduce to a non-overlapping set while maintaining the same coverage.

ADD REPLY
0
Entering edit mode

How about 'underlap' :)

ADD REPLY
0
Entering edit mode

I like underlap very much.

ADD REPLY
5
Entering edit mode
13.0 years ago
Martin Morgan ★ 1.6k

With R / Bioconductor

library(GenomicRanges)
in = read.table("in.bed", header=FALSE)
gr0 = with(in, GRanges(V1, IRanges(V2, V3)))
gr1 = reduce(gr0)                  # this is where the work is done
out = as.data.frame(gr)[,1:3]
write.table(out, "out.bed", header=FALSE, sep="\t", quote=FALSE, 
            row.names=FALSE, col.names=FALSE)

Extending gr0 by 100 on both ends is resize(gr0, width=width(gr0) + 200, fix="center")

ADD COMMENT
0
Entering edit mode

Thanks for the R script, i do not code in R, but it is useful to see a practical working example!

ADD REPLY
3
Entering edit mode
13.0 years ago
lh3 33k

sort -k1,1 -k2,2n in.bed|awk '{if($1!=l||$2>e){if(l)print l,b,e;l=$1;b=$2;e=$3}else e=$3}END{print l,b,e}'

ADD COMMENT
0
Entering edit mode

Very interesting, thanks! Although i freely admit i do not know what is happening in the awk command.

ADD REPLY
2
Entering edit mode
13.0 years ago
brentp 24k

If I understand correctly, here is some python code that takes your sorted, original, unpadded BED3 and extends by pad basepairs unless it overlaps with the next interval.

just change the first line to the name your input BED3 and call like:

python pad.py > padded.bed

and save this into pad.py

#!/usr/bin/python
abed = "t.bed"
pad = 100

def gen_lines(bed):
    b = open(bed)
    for chrom, start, end in (l.rstrip().split("\t") for l in b):
        yield {'chrom': chrom, 'start': int(start),  'end': int(end)}

iterable = gen_lines(abed)

a = iterable.next()

for b in iterable:
    # same chr             # no overlap
    if a['chrom'] != b['chrom'] or b['start'] - pad > a['end'] + pad:
        a['start'] -= pad
        a['end']   += pad
    else:
        # overlap, adjust a and b
        dist = b['start'] - a['end']
        a['end'] += dist/2
        b['start'] -= dist/2

    a['start'] = max(0, a['start'])
    print "%(chrom)s\t%(start)d\t%(end)d" % a

    a = b
print "%(chrom)s\t%(start)d\t%(end)d" % b
ADD COMMENT
2
Entering edit mode
13.0 years ago

Ian, as far as I understand, you start with pre-padded BED coordinates like:

chr1    0       10
chr1    15      20
chr1    17      30
chr2    0       10

and want to convert it to:

chr1    0       10
chr1    15      20
chr1    20      30
chr2    0       10

where the start coordinate of the third line is readjusted to be non-overlapping.

Start by sorting your BED file with sortBed from bedtools. Then this Perl script will readjust the start coordinates based on the previous line, eliminating overlaps:

#!/usr/bin/env perl
## Parse a sorted BED3 file, avoiding overlapping coordinates
use strict;

my $infile = $ARGV[0];
chomp $infile;
open(INFILE, "<$infile");
open(OUTFILE, ">$infile.fixed");

my $last_space = "";
my $last_end = -1;
while (<INFILE>) {
  # prepare the current line
  chomp $_;
  my ($cur_space, $cur_start, $cur_end) = split("\t", $_);
  $cur_start = int($cur_start);
  $cur_end = int($cur_end);
  # if we are on the same chromosome and last end overlaps, reset our start
  if (($cur_space eq $last_space) && ($cur_start < $last_end)) {
    $cur_start = $last_end;
  }
  $cur_end >= $cur_start || die "Start greater than end";
  # output adjusted information and reset for next line
  printf OUTFILE ("%s\t%s\t%s\n", $cur_space, $cur_start, $cur_end);
  $last_space = $cur_space;
  $last_end = $cur_end;
}
ADD COMMENT
1
Entering edit mode

remember the ucsc interprets the ends in bed format to be non-inclusive so chr1 15 20 chr1 21 30 actually omits position 20

ADD REPLY
1
Entering edit mode

Thanks Jeremy. That's what I get for trying to program in Perl and think at the same time. I cleaned that up and fixed the example to be 0-based instead of 1-based.

ADD REPLY
0
Entering edit mode

Thanks for that. A nice simple solution AND in Perl :)

ADD REPLY
0
Entering edit mode

Good advice about BEDTools, but i just use:

sort -k1,1 -k2,2n file > file.sorted

ADD REPLY
0
Entering edit mode

nice and easy solution, but it must be pointed out that this script assumes that the BED file regions are sorted, which isn't necessarily always the case. you'll have to check that in advance.

ADD REPLY
1
Entering edit mode
13.0 years ago
Benm ▴ 710

@Brad Chapman: @brentp: @lan: I think you want to combine these overlap regions right? Ex, a more complicated example, named "test.txt":

chr1    1       10
chr1    15      20
chr1    17      30
chr1    18      19
chr2    1       10

and want to combine the overlap regions (filter the redundant) it to "result.txt":

chr1    1       10
chr1    15      30
chr2    1       10

I write a awk script for it:

BEGIN{
    i=0;
    c="";
    OFS="\t";
}
{
    if (c!=$1){
        if (c != ""){
            combine(chr,c,i);
        }
        i=0;
        c="";
    }
    c=$1;
    chr[i,0]=$2;
    chr[i,1]=$3;
    i++
}
END{
    combine(chr,c,i);
}
function combine(arr,c,i) {
    j=0;
    new[j,0]=arr[0,0];
    new[j,1]=arr[0,1];
    for (k=1;k<i;k++)
    {
        if ((arr[k,0]<=new[j,1])&&(arr[k,1]>=new[j,1])){
            new[j,1]=arr[k,1];
        }
        else if (arr[k,0]>new[j,1]){
            j++;
            new[j,0]=arr[k,0];
            new[j,1]=arr[k,1];
        }
    }
    for (n=0;n<=j;n++){
        print c,new[n,0],new[n,1]
    }
}

you just need to paste the above code into a combine.awk, then type command as below in your terminal and run:

awk -f combine.awk your test.txt > result.txt
ADD COMMENT
0
Entering edit mode

sorry but your naswer is unreadable, use formatting

ADD REPLY
0
Entering edit mode

I don't think this is an issue with my data, but looks very useful for other data sets.

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks for your reply, but i don't think either is applicable. There are plenty of solutions to both of the problem detailed in the links, but i don't see how they relate to me question.

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6