3.5 years ago by
Seattle, WA USA
Let's look at the following BEDOPS-based solution, which is a modification of the approach authored here by Shane Neph:
$ bedops --everything file1.bed file2.bed ... fileN.bed \
| bedmap --echo-map --fraction-both 0.6 - \
| awk '(split($0, a, ";") > 1)' - \
| sed 's/\;/\n/g' - \
| sort-bed - \
| uniq - \
It might look a bit overwhelming, but it is really quite straightforward if we walk through this line by line.
- Take the multiset union of all elements from all N input files with
bedops --everything. Note that this is not the same as a
--merge operation, which instead makes contiguous regions from all overlapping input regions. Pipe this unioned set to
- For each element from all files, use
bedmap --echo-map to print any other elements from the multiset union that overlap that first element by 60% or more, using
--fraction-both 0.6. The result will be a semi-colon delimited string that contains one or more overlapping elements. Note that this result will always include the original element, because it overlaps itself by 100%. Pipe this result to the
awk command splits the string on the semi-colon delimiter and prints out the string if there are two or more overlapping elements. We do this filter step, because we are not interested in any element overlapping itself and only itself, but only multiple (2+) elements that meet the 60% or greater overlap criterion. We pipe this string to
sed statement replaces all semi-colons with newline characters. This creates BED-formatted output that we pipe to
sort-bed tool takes the overlapping elements and puts them in lexicographically-sorted order. We pipe this to
uniq to remove duplicate elements.
answer.bed is a sorted BED file that contains all unique elements from
fileN.bed, where there is mutual 60% or greater overlap.
Note that this is a highly efficient solution, compared with O(2^N) pairwise comparisons required by other approaches. Further, we make use of Unix standard input and output streaming (note the hyphen at the end of each command, denoting standard input) to avoid the cost of making unnecessary intermediate files; writing to disk is expensive. The only thing written to disk is the final result.