Is there an easy way to use bedtools (or any other software) to get the features of the intersection of multiple bed files, for features/intervals that occur in at least 2 bed files?
Is there an easy way to use bedtools (or any other software) to get the features of the intersection of multiple bed files, for features/intervals that occur in at least 2 bed files?
Here is a general approach with BEDOPS bedmap --count
, which generalizes to N internally-disjoint (and sort-bed
-sorted) input files:
$ bedops --everything 1.bed 2.bed 3.bed ... N.bed \
| bedmap --count --echo --delim '\t' - \
| uniq \
| awk -v OVERLAPS=2 '$1 >= OVERLAPS' \
| cut -f2- \
> common.bed
The bedmap
step uses, by default, overlap of one or more bases for inclusion. You can modify this threshold to be more stringent. The readthedocs site has details, or run bedmap --help
for a shorter overview.
By changing the test in the awk
statement, this approach can be modified to return other subsets of the input's power set, e.g., all elements common to exactly, less than, or greater than N inputs.
Once you have the elements which meet your overlap threshold, you can then process those elements to get their overlapping regions via a final operation:
$ bedops --partition common.bed | bedmap --count --echo --delim '\t' - common.bed | awk '$1 >= 2' | cut -f2- > overlaps.bed
If your inputs are not internally disjoint — if some elements may overlap within any one of the N input files — you might instead apply some ID-based tricks I describe in my answer over here: A: Intersect multiple BED files
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi, thanks for the answer. I am a bit confused by when you say "internally disjoint". Are you implying that for any region in any bed file, it must not overlap with any other region in any other bed file? If so, I think this would not help me; I am looking to preserve the regions that overlap in at least 2 bed files.
Or by "internally disjoint" do you refer to any region within a single bam file? As in, no region within any given bam file can overlap with any other region in that same bam file?
What I mean is the case where any two regions within a single BED file overlap.
For instance, sequencing reads converted from BAM or SAM to BED will almost certainly overlap within the same file, because reads typically pile up that way.
Other types of annotations could also be problematic, in that they can overlap: genes, TF binding sites, DHSs.
If this is the case, then you need to do a little more work to add (unique) per-file identifiers to each BED file, so that when you find overlaps, you also test if they have different per-file identifiers.
Otherwise, without that step (or some similar approach) you would get overlapping regions that are internal to one file. This does not appear to be what you want. It sounds like you want overlaps only from at least two or more files.
I was looking at the bedmaps manual, and I read this line:
If I utilize more than 2 bed files A, B, C...etc., will this get all the common elements in bed files B, C, D... compared to A, or will it compare among all bed files (such as A, C, D, E... to B) as well?
The
bedops
tool allows more than two files (A, B, C, etc.).Conversely, the
bedmap
tool only takes two files as arguments (reference and map).However, if you're using
bedmap
, you can use a trick called a "process substitution" to replace thereference.bed
ormap.bed
files.A process substitution is a fancy term for building a BED file on the fly. We can use it as input to
bedmap
.Here's an example:
In this example, the former
map.bed
input gets replaced with what would come out ofbedops --everything A.bed B.bed ... N.bed
(the set union of files A, B, C, etc.).This is functionally equivalent to:
Any command you put into the
...
in<(...)
returns a file. You can specify this as input tobedmap
,bedops
, and other tools. You could do<(bedops --merge A.bed ... N.bed)
, or any other operations on any other files. It just depends on what kinds of elements you want to map.Also if you could explain the purpose of the second command, that would be helpful. I ran these commands but in the overlap.bed file, I see that the intervals are not contiguous and are only 1 base long. Is there a reason for this?
Can you post the command you're running and perhaps provide access to your input files (or a small sample snippet)? I can run that locally and help answer your question with more specifics.