Question

find overlapping Chip-seq peaks and return outerboundries

0

Entering edit mode

5.4 years ago

m.wekking ▴ 10

I have .bed files with different peaks and I need to find the overlap, but when two peaks overlap I want the program to return the outer boundries of the two peaks, i.e.

$ cat sample1.bed
chr1   20    40
chr1   60    80
chr2   80    120

$ cat sample2.bed
chr1   17    35
chr1   75    90
chr2   70    100

I want as output

chr1   17    40
chr1   60    90
chr2   70    120

Is this possible using bedTools or any other tool (preferably not in R)?

ChIP-Seq Overlap • 2.3k views

ADD COMMENT • link updated 5.4 years ago by ATpoint 82k • written 5.4 years ago by m.wekking ▴ 10

score 2 · Answer 1 · 2018-12-07

Edit: 10.12.18 => Modified the code snippet to work with an arbitrary number of BED files:

This is the code to work with two files. It does the intersection outputting the entire input intervals of -a and -b in case of overlap, and then use awk to select the smaller of the two start coordinates and the larger of the two end coordinates. The awk part can for sure be written more elegantly.

bedtools intersect -a sample1.bed -b sample2.bed -wa -wb | \
  awk 'OFS="\t" {if ($2 < $5) print $1, $2, $3, $6; else print $1, $5, $3, $6}' | \
  awk 'OFS="\t" {if ($3 > $4) print $1, $2, $3; else print $1, $2, $4}'

To work with an arbitrary number of files, you basically do the same thing. First, you intersect the first two files with the above code and store the result to disk. Then you loop through all other BED files and apply the same code to intersect the intermediate result with the current BED file, storing the result again to disk. Repeat until all BED files have been processed. For this, all files in the current directory that are to be processed hve the .bedsuffix:

#!/bin/bash

## Code as in my first answer, just wrapped into a function:
function MyIntersect {
  bedtools intersect -a $1 -b $2 -wa -wb | \
    awk 'OFS="\t" {if ($2 < $5) print $1, $2, $3, $6; else print $1, $5, $3, $6}' | \
    awk 'OFS="\t" {if ($3 > $4) print $1, $2, $3; else print $1, $2, $4}'
}; export -f MyIntersect

## Write names of all BED files to a file:
ls *.bed > all_beds.txt

## Get total number of BED files in current directory:
TOTALNUMBER="$(cat all_beds.txt | wc -l | sed -e 's/^[ \t]*//')"

## Start with the first two files manually (head -n1 gets the first file, head -n2 | tail -n 1 the second one from all_beds.txt):
MyIntersect "$(head -n 1 all_beds.txt)" "$(head -n 2 all_beds.txt | tail -n 1)" > intermediate.txt

## and now do the same thing with this intermediate result from the command above until all other BED files have been used:
for i in `seq 3 $TOTALNUMBER`
  do
  MyIntersect intermediate.txt "$(head -n $i all_beds.txt | tail -n 1)" > intermediate_new.txt
  mv intermediate_new.txt intermediate.txt
  done

## Output is final.bed
mv intermediate.txt final.bed

score 0 · Answer 2 · 2018-12-07

Not sure about chr2 in your example, but you could do the following for 1 to N sorted BED files:

$ bedops --merge A.bed B.bed ... N.bed > merge.bed
$ bedops --everything A.bed B.bed ... N.bed > union.bed

Then use the merge and union sets to look for two or more element overlaps:

$ bedmap --count --echo --delim '\t' merge.bed union.bed | awk '($1 >= 2)' | cut -f2- > answer.bed

Replace 2 with N or any other threshold you'd like.

When you have some overlap criteria to meet between sets, pairing bedmap with awk can work well.