Intersect multiple bed and keep all fields
4
1
Entering edit mode
8.7 years ago
user230613 ▴ 360

Hi there biostars,
I'm trying to intersect three bed files in order to get common features between the three of them, two of them, and also features specific per each bed. I've tried with multiIntersectBed (bedtools) and bedops also, but, the problem is that I want to keep all the fields of my input files for the overlapping. I mean, I'm looking for something like -wao in bedtools intersect but with multiple bed files.

Any suggestion will be appreciated.

bed bedtools bedops • 9.6k views
ADD COMMENT
0
Entering edit mode

Hi, could you do this?

ADD REPLY
3
Entering edit mode
8.7 years ago

Assuming the inputs are disjoint, here is a modification of a general BEDOPS-based solution from Shane Neph for N input files (for you, three inputs, but this works for the general case of N inputs):

$ bedops --everything file1.bed file2.bed ... fileN.bed \
    | bedmap --echo-map - \
    | awk '(split($0, a, ";") == 3)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    > answer.bed

The file answer.bed contains a unique list of elements from N number of BED files that overlap between three elements by at least one or more bases.

Let's walk through this line by line.

  1. Take the multiset union of all elements from all N input files with bedops --everything. Note that this is not the same as a --merge operation, which instead makes contiguous regions from all overlapping input regions. Pipe this unioned set to bedmap.
  2. For each element from all files, use bedmap --echo-map to print any other elements from the multiset union that overlap that first element by one or more bases, using the default overlap criteria. The result will be a semi-colon delimited string that contains one or more overlapping elements. Note that this result will always include the original element, because it overlaps itself by 100%. Pipe this result to the awk statement.
  3. The awk command splits the string on the semi-colon delimiter and prints out the string if there are exactly three overlapping elements. We do this filter step, because we are interested in any one element that overlaps two other elements by the stated overlap criterion. We pipe this string to sed.
  4. The sed statement replaces all semi-colons with newline characters. This creates BED-formatted output that we pipe to sort-bed.
  5. The sort-bed tool takes the overlapping elements and puts them in lexicographically-sorted order. We pipe this to uniq to remove duplicate elements.

The file answer.bed is a sorted BED file that contains all unique elements from file1.bed through fileN.bed, where there is mutual, minimally one-base overlap between three sets.

If you look at that awk statement, you'll see how to modify that test for your other two use cases, such as when you want elements that overlap no other elements in any other sets ("one"), or the case where you want elements that overlap with only one other set ("two").

You might also find this page useful, which describes this more general approach to this problem and why it is scalable with BEDOPS.

ADD COMMENT
0
Entering edit mode

A.bed

chr1    817380  817560  +
chr1    827560  827740  +
chr1    867637  867817  +
chr1    869928  870108  +
chr1    904494  904674  +
chr1    906813  906993  +
chr1    912814  912994  +
chr1    921200  921380  +
chr1    923743  923923  +
chr1    924685  924865  +

B.bed

chr1    826994  827076  +
chr1    827606  827688  +
chr1    858085  858167  +
chr1    869888  869970  +
chr1    876620  876702  +
chr1    897446  897528  +
chr1    904661  904743  +
chr1    906858  906940  +
chr1    912971  913053  +
chr1    921210  921292  +

So as an example I have given 2 bed file so I want to merge multiple bedfiles and I want to keep the strand information when I do bedops merge but the strand information is lost. How do I retain the strand information

ADD REPLY
1
Entering edit mode
8.7 years ago

I think by default, multiIntersectBed handles that. For e.g:

# a.bed
chr1  6   12
chr1  10  20
chr1  22  27
chr1  24  30

# b.bed
chr1  12  32
chr1  14  30

# c.bed
chr1  8   15
chr1  10  14
chr1  32  34
$ multiIntersectBed -i a.bed b.bed c.bed 
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1

Here the 4th column says in how many files the region is present and 5th column says what are those files. Here the region chr1:12-15 present in three files, chr1:15-20 present in two files and chr1:30-32 andchr:32-34 present in any one of the files. This satisfies your statement "common features between the three of them, two of them, and also features specific per each bed."

Are you looking for something more?

ADD COMMENT
0
Entering edit mode

Yes, I've already tried this one. But, the thing is that I've more fields (columns) with another information in the input bed files that I'd like to keep after intersecting. Sorry if I didn't explain myself in a clear way. And thanks!

ADD REPLY
0
Entering edit mode

Can you just show small example (3-5 lines) of how the data looks and what is the expected output?

ADD REPLY
1
Entering edit mode
8.7 years ago
Katie D'Aco ★ 1.1k

Would piping output from intersect to another intersect give you what you want?

bedtools intersect -wao -a a.bed -b b.bed | bedtools intersect -wao -a - -b c.bed

chr1    10      20      a1      1       +       .       -1      -1      .       -1      .       0       chr1    0  100      c1      1       +       10
chr1    100     200     a2      2       -       chr1    90      101     b2      2       -       1       .       -1 -1       .       -1      .       0
chr1    100     200     a2      2       -       chr1    100     110     b3      3       +       10      .       -1 -1       .       -1      .       0
ADD COMMENT
0
Entering edit mode
6.5 years ago

This does not seem to work for me. It collapses some intervals together.

ADD COMMENT
0
Entering edit mode

The above suggestions are fine in case you have edited your files in windows or in excel before using bedtools just run dos2unix and use bedtools to intersect then it will not collapse

ADD REPLY

Login before adding your answer.

Traffic: 1521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6