Question: how to merge overlapping intervals in a bedgraph file
0
gravatar for Fidel
4.1 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

I have a simple problem that I think could be resolved with current software; however I can't find a solution.

I am using a program that produces a bedgraph with overlapping intervals for example:

chr1 0 100 2
chr1 50 150 1
chr1 100 200 3

I would like to obtain a bedgraph that computes the maximum (or any other operation) for the overlapping regions. For example, using the maximum the output should look like:

chr1 0 50 2
chr1 50 100 2
chr1 100 150 3
chr 1 150 200 3

I tried using unionbg from bedtools which is expected to do something very similar to what I want but it requires more than one file and, if the same file is given twice, the output is not what I am looking for. merge, from bedtools is not useful because it will merge all regions into a single one. 

I already wrote a simple script to solve the problem but only handles the case of two overlaps as in the previous example and I am under the impression that a faster and more powerful solution should be available.

sequence beedtools • 4.0k views
ADD COMMENTlink modified 4.1 years ago by morovatunc400 • written 4.1 years ago by Fidel1.9k
2
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

If it's okay to use BEDOPS, you could use a one-liner of bedops --partition, bedmap and awk:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed |\
  awk '{
          n = split($4, a, ";");
          max = a[1];
          for(i = 2; i <= n; i++)
          {
              if (a[i] > max)
              {
                  max = a[i];
              }
          }
          print $1"\t"$2"\t"$3"\t"max;
      }' - > answer.bed

Make sure the input regions are tab-delimited and sorted with sort-bed.

This works by partitioning the input into disjoint regions however many overlaps there are (bedops), mapping the disjoint regions to the original input to get a string of overlapping ids (bedmap), and using awk to find the maximum value in that string. Then awk prints the disjoint region and its max id.

By using standard input/output streams and using a linear sweep for the line maximum, this should run fast and have low memory overhead.

Here's what things look like at each step of the pipeline.

First, we partition the input regions into disjoint elements:

$ bedops --partition regions.bed
chr1    0      50
chr1    50     100
chr1    100    150
chr1    150    200

You can throw as many overlaps into --partition as you like.

Then we map the partitioned set back to the original input regions, and get the list of overlapping ids:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed
chr1    0      50     2
chr1    50     100    1;2
chr1    100    150    1;3
chr1    150    200    3

Then awk parses each of the id-strings in the fourth column to get the maximum value, printing the result:

$ bedops --partition regions.bed |\
  bedmap --echo --echo-map-id-uniq --delim '\t' - regions.bed |\
  awk '{ 
          n = split($4, a, ";");
          max = a[1];
          for(i = 2; i <= n; i++)
          {
              if (a[i] > max)
              {
                  max = a[i];
              }
          }
          print $1"\t"$2"\t"$3"\t"max;
       }' -
chr1    0      50     2
chr1    50     100    2
chr1    100    150    3
chr1    150    200    3
ADD COMMENTlink modified 20 days ago by RamRS25k • written 4.1 years ago by Alex Reynolds29k

Hi,I have tried this chunk of code, but when using bedgraphToBigWig, it's reported with an error of multiple lines of end before start, anyway to fix this?

ADD REPLYlink written 8 weeks ago by dliu2013040

No idea, but please post a separate question with sample input, which I can run locally, and I'll be happy to take a look.

ADD REPLYlink written 7 weeks ago by Alex Reynolds29k
1
gravatar for Kamil
4.1 years ago by
Kamil2.0k
Boston
Kamil2.0k wrote:

I believe you cannot solve this without writing your own algorithm. Here's my first attempt:

cat a.bed

chr1    0       100     2
chr1    50      150     1
chr1    100     200     3

(cut -f2 a.bed; cut -f3 a.bed) |\
    sort -nu |\
    perl -ne '
        BEGIN{$i=0} chomp;
        push @F, $_;
        if($i++){print "chr1\t$F[$i-2]\t$F[$i-1]\n"}
        ' > b.bed

bedtools intersect -a a.bed -b b.bed | sort -k2n -k3n -k4nr | perl -lane 'print unless $h{$F[0,1,2]}++'

Output:

chr1    0       50      2
chr1    50      100     2
chr1    100     150     3
chr1    150     200     3
ADD COMMENTlink modified 20 days ago by RamRS25k • written 4.1 years ago by Kamil2.0k

Thanks for your answer. But, precisely I want to avoid a custom solution. As I said I already have a custom solution in python that handles also different chromosome names:

import fileinput

prev = None
for line in fileinput.input():
    fields = line.strip().split('\t')
    if prev is None or prev[0] != fields[0]:
        prev = fields
        continue
    print "{}\t{}\t{}\t{}".format(fields[0],
                                  fields[1],
                                  prev[2],
                                  max(float(prev[3]),
                                      float(fields[3]))
                                  )
    prev = fields
ADD REPLYlink modified 20 days ago by RamRS25k • written 4.1 years ago by Fidel1.9k
0
gravatar for Damian Kao
4.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can try doing this:

  1. Use clusterBed to cluster overlapping intervals
  2. For each cluster get the elementary intervals. Basically just put all start,end coordinates into a list, make the list unique, then sort the list. So something like this:

    for cluster in clusters:
    coords = set()
    for line in cluster:
        cols = line.strip().split()
        ref = cols[0]
        start, end, cov = map(int, cols[1:])
        coords.add(start)
        coords.add(end)
    coords = list(coords)
    coords.sort()
    elementary_intervals = []
    for i in range(len(coords) - 1):
        elementary_intervals.append((coords[i],coords[i+1]))
    
  3. For each elementary interval, find overlaps with cluster members and find the maximum coverage (or whatever other operations). So something like this:

    for e_interval in e_intervals:
    covs = []
    for member in cluster:
        if overlap(e_interval,member):
            covs.append(memberCov)
    max(covs) or sum(covs) or whatever
    
ADD COMMENTlink modified 20 days ago by RamRS25k • written 4.1 years ago by Damian Kao15k

I don't think clusterBed is useful here. All bins are same size same distant and end up in the same cluster.

ADD REPLYlink written 4.1 years ago by Fidel1.9k

I see. I misunderstood what you wanted. I've edited the answer.

ADD REPLYlink written 4.1 years ago by Damian Kao15k
0
gravatar for morovatunc
4.1 years ago by
morovatunc400
Turkey
morovatunc400 wrote:

I am trying to do same thing for analyinz chipseq bed formatted data. If you can transform your bed graph to bed maybe you can get them that merge peaks function. though i am not %100 sure.

http://homer.salk.edu/homer/ngs/mergePeaks.html

ADD COMMENTlink modified 20 days ago by RamRS25k • written 4.1 years ago by morovatunc400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1216 users visited in the last hour