Question: Define unique intervals from a set of peaks
0
2.7 years ago by
fusion.slope200
fusion.slope200 wrote:

Hello,

i have called peaks from 6 individual bam files and now I would like to merge the bed files in a way in which i will have uniq intervals. (where each uniq interval is present in all the 6 files).

How is it possible to do that? Is there a way with bedtools? I read the tutorial but there is nothing about this. Any idea would be really appreciated.

Cheers

modified 2.7 years ago by Alex Reynolds29k • written 2.7 years ago by fusion.slope200

Unique would be a set element present in only one of N inputs. Non-unique would mean an element present or overlapping in two or more of N inputs. Which do you mean?

An element overlapping 6 out of 6 Inputs.

The simplest solution would be multiple iterations of BEDTools 'intersect' with each of your six BAMs (intersect bam1 and bam2, intersect that output with bam3, etc).

1

Consider the following example:

A intersects B, and B intersects C. However, A does not intersect C.

If you want intervals overlapping across all sets, you can do pairwise comparison tests between all the input sets, which does not scale well and requires storing intermediate files.

Or you can use the faster approach I describe, which uses sorted BED files with `bedops` and `bedmap`, which does not require pairwise comparisons or intermediate files.

2
2.7 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.

Say you have N files (your specific example uses 6 files, but this shows the generic solution for N files).

You appear to want all elements from the N files, which overlap at least one element from each of all other N-1 sets.

First, convert any non-BED files to sorted BED with the appropriate conversion tools, e.g.: `bam2bed` from BEDOPS:

``````\$ for inFn in `find . -name "*.bam"`; do echo \$inFn; outFn=\$inFn.bed; bam2bed < \$infn > \$outFn; done
``````

Second, label each BED file to note where it is coming from:

``````\$ for inFn in `find . -name "*.bam.bed"`; do echo \$inFn; outFn=\${inFn%%.*}.labeled.bed; prefix=\${inFn%%.*}; echo \${prefix} | cut -f1-3 \${inFn} - > \$outFn; done
``````

Set a shell variable with the number `N` set to your overlap threshold, e.g.:

``````\$ export N=6
``````

Then do the set operations on the labeled files with BEDOPS `bedops` and `bedmap`:

``````\$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
| bedmap --echo --echo-map-id-uniq --delim '\t' - \
| cut -f1-3,5 - \
| awk -vN=\${N} '{ n = split(\$4, a, ";"); if (n==N) { print \$0; } }' \
> intervals_overlapping_all_N_inputs.bed
``````

The file `intervals_overlapping_all_N_inputs.bed` contains a list of intervals that overlap all N sets.

This may be enough, but perhaps you want more detail. Now you can go back and filter all the converted BAM files for those elements:

``````\$ for inFn in `find . -name "*.bam.bed"`; do echo \$inFn; \$outFn=\${inFn%%.*}.filtered.bed; bedops --element-of -1 \$inFn intervals_overlapping_all_N_inputs.bed > \$outFn; done
``````

Each `*.filtered.bed` will contain BED-formatted reads specific to the parent BAM file, which overlap all N sets.

This looks like a lot of work, but this will run much faster than a pairwise approach with other tools.