Get intersection of files occuring in at least 2 files?
1
0
Entering edit mode
4.1 years ago
vctrm67 ▴ 50

Is there an easy way to use bedtools (or any other software) to get the features of the intersection of multiple bed files, for features/intervals that occur in at least 2 bed files?

bedtools • 1.3k views
ADD COMMENT
1
Entering edit mode
4.1 years ago

Here is a general approach with BEDOPS bedmap --count, which generalizes to N internally-disjoint (and sort-bed-sorted) input files:

$ bedops --everything 1.bed 2.bed 3.bed ... N.bed \
    | bedmap --count --echo --delim '\t' - \
    | uniq \
    | awk -v OVERLAPS=2 '$1 >= OVERLAPS' \
    | cut -f2- \
    > common.bed

The bedmap step uses, by default, overlap of one or more bases for inclusion. You can modify this threshold to be more stringent. The readthedocs site has details, or run bedmap --help for a shorter overview.

By changing the test in the awk statement, this approach can be modified to return other subsets of the input's power set, e.g., all elements common to exactly, less than, or greater than N inputs.

Once you have the elements which meet your overlap threshold, you can then process those elements to get their overlapping regions via a final operation:

$ bedops --partition common.bed | bedmap --count --echo --delim '\t' - common.bed | awk '$1 >= 2' | cut -f2- > overlaps.bed

If your inputs are not internally disjoint — if some elements may overlap within any one of the N input files — you might instead apply some ID-based tricks I describe in my answer over here: A: Intersect multiple BED files

ADD COMMENT
0
Entering edit mode

Hi, thanks for the answer. I am a bit confused by when you say "internally disjoint". Are you implying that for any region in any bed file, it must not overlap with any other region in any other bed file? If so, I think this would not help me; I am looking to preserve the regions that overlap in at least 2 bed files.

Or by "internally disjoint" do you refer to any region within a single bam file? As in, no region within any given bam file can overlap with any other region in that same bam file?

ADD REPLY
0
Entering edit mode

What I mean is the case where any two regions within a single BED file overlap.

For instance, sequencing reads converted from BAM or SAM to BED will almost certainly overlap within the same file, because reads typically pile up that way.

Other types of annotations could also be problematic, in that they can overlap: genes, TF binding sites, DHSs.

If this is the case, then you need to do a little more work to add (unique) per-file identifiers to each BED file, so that when you find overlaps, you also test if they have different per-file identifiers.

Otherwise, without that step (or some similar approach) you would get overlapping regions that are internal to one file. This does not appear to be what you want. It sounds like you want overlaps only from at least two or more files.

ADD REPLY
0
Entering edit mode

I was looking at the bedmaps manual, and I read this line:

Looking back at the Map and Reference datasets, let’s say we want to count the number of elements in Map that overlap a given Reference element, as well as the extent of that overlap as measured by the total number of overlapping bases from mapped elements. For this, we use the --count and --bases flags, respectively:

$ bedmap --echo --count --bases reference.bed map.bed
chr21   33031200    33032400    ref-1|61|1200
chr21   33031400    33031800    ref-2|21|400
chr21   33031900    33032000    ref-3|6|100
This result tells us that there are 61 elements in Map that overlap ref-1, and 1200 total bases from the 61 elements overlap bases of ref-1. Similarly, 21 elements overlap ref-2, and 400 total bases from the 21 elements overlap bases of ref-2, etc.

If I utilize more than 2 bed files A, B, C...etc., will this get all the common elements in bed files B, C, D... compared to A, or will it compare among all bed files (such as A, C, D, E... to B) as well?

ADD REPLY
0
Entering edit mode

The bedops tool allows more than two files (A, B, C, etc.).

Conversely, the bedmap tool only takes two files as arguments (reference and map).

However, if you're using bedmap, you can use a trick called a "process substitution" to replace the reference.bed or map.bed files.

A process substitution is a fancy term for building a BED file on the fly. We can use it as input to bedmap.

Here's an example:

$ bedmap --echo --count --bases reference.bed <(bedops --everything A.bed B.bed ... N.bed)

In this example, the former map.bed input gets replaced with what would come out of bedops --everything A.bed B.bed ... N.bed (the set union of files A, B, C, etc.).

This is functionally equivalent to:

$ bedops --everything A.bed ... N.bed | bedmap --echo --count --bases reference.bed -

Any command you put into the ... in <(...) returns a file. You can specify this as input to bedmap, bedops, and other tools. You could do <(bedops --merge A.bed ... N.bed), or any other operations on any other files. It just depends on what kinds of elements you want to map.

ADD REPLY
0
Entering edit mode

Also if you could explain the purpose of the second command, that would be helpful. I ran these commands but in the overlap.bed file, I see that the intervals are not contiguous and are only 1 base long. Is there a reason for this?

ADD REPLY
0
Entering edit mode

Can you post the command you're running and perhaps provide access to your input files (or a small sample snippet)? I can run that locally and help answer your question with more specifics.

ADD REPLY

Login before adding your answer.

Traffic: 2748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6