Question: Intersect multiple BED files
2
gravatar for int11ap1
3.7 years ago by
int11ap1360
Barcelona
int11ap1360 wrote:

I have a directory with 50 bed files. I want to get all regions in common in at least 50% of my files. How would I get that? Using `bedopts` I can intersect all, but I want at least those existing regions in 25 files.

intersect bed • 2.9k views
ADD COMMENTlink modified 3.7 years ago by Aaronquinlan11k • written 3.7 years ago by int11ap1360
6
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

This isn't counting the fraction of overlap between intervals, so you can't directly use available options in bedops or bedmap, but you can use some ID tricks to uniquely associate an interval with the BED file it comes from, and use that ID information to do counting and thresholding.

Here is one way to do so:

1. Label your input BED files so that their IDs uniquely identify their intervals, e.g., for BED files called 1.bed, 2.bed, etc.:

$ for idx in `seq 1 50`; do cut -f1-3 $idx.bed | awk -vidx=$idx '{ print $0"\t"idx; }' > $idx.id.bed; done

You would modify this loop depending on how you name your files. Basically, however you do this, the goal is to print the interval in the first three columns, and use the number of files you have as a unique identifier in the fourth column. You would print 1 in the fourth column of the first file 1.id.bed, and 5 in the fourth column of the fifth file 5.id.bed, and so on.

I use numbers here, but you can use any unique identifier you want, instead. Cell type. Experiment or sample name, etc. The key part to making this work is that the chosen identifier is unique to that input file and that file only.

2. Take the union of all these ID-tagged files with BEDOPS bedops --everything and pipe the result to BEDOPS bedmap with the --echo-map-id-uniq operation:

$ bedops --everything *.id.bed | bedmap --echo --echo-map-id-uniq --delim '\t' - > all.bed

What you have at this point, for every interval from every input BED file, is a file called all.bed, which shows that interval in the first three columns, the file it comes from in the fourth column, and a fifth column that is a semi-colon delimited string of every unique ID value that overlaps that interval, e.g.:

...
chrN     123    456     5     2;5;6;7;11;12;...;46;48
chrN     234    567     5     5;6;7;8
...

Because we use --echo-map-id-uniq, this string contains unique identifiers — no duplicates. This uniqueness property is how we can count how many of the original input files overlap.

3. To get an interval that has regions in common with at least 50% of your 50 input files — 25 files other than the original file the interval comes from — you can use awk to split this ID string in the fifth column by the semi-colon delimiter, counting how many unique IDs you find and printing the interval, if the threshold of 26 ID values is met (25 IDs other than the ID representing the original interval's source):

$ awk -vthreshold=26 '{ \
    n_ids = split($5, ids, ";"); \
    if (n_ids >= threshold) { \
      print $0; \
    } \
  }' all.bed > thresholded.bed

A shorter, cleaner awk statement with the same functionality would be:

$ awk -vthreshold=26 '(split($5, ids, ";") >= threshold)' all.bed > thresholded.bed

Bonus: If you need to get back any additional columns of the original interval, you can just do another bedmap operation on thresholded.bed on a union of the 50 input BED files.

$ bedops --everything 1.bed 2.bed ... 50.bed > union.bed
$ bedmap --echo --exact union.bed thresholded.bed > union-thresholded.bed

Note: Input files should be sorted before use with BEDOPS tools, if the sort state is unknown or different. BEDOPS includes a tool called sort-bed that sorts BED files faster than GNU sort or alternatives.

$ sort-bed < foo.unsorted.bed > foo.sorted.bed

You can use whatever you like here, so long as the BEDOPS sort criteria are met before passing data to bedops and bedmap.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds28k

So useful, thanks a lot!!

ADD REPLYlink written 3.7 years ago by int11ap1360

You're welcome!

ADD REPLYlink written 3.7 years ago by Alex Reynolds28k
2
gravatar for Aaronquinlan
3.7 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

Have a look at this post: Bedtools Compare Multiple Bed Files?

ADD COMMENTlink written 3.7 years ago by Aaronquinlan11k
1
gravatar for Amitm
3.7 years ago by
Amitm1.6k
UK
Amitm1.6k wrote:

hi,

Using (recent versions) bedtools intersect, give one BED as 1st arg. and the rest 49 as 2nd arg. And use the -c parameter. That will give you the count of overlaps across the files. Then you can use that to apply your threshold.

Ofcourse the above works assuming you dont get multiple overalps from a singel BED. So you can first create unique intervals of each BED and then do the above. 

ADD COMMENTlink written 3.7 years ago by Amitm1.6k
0
gravatar for stolarek.ir
3.7 years ago by
stolarek.ir600
Poland
stolarek.ir600 wrote:

Since this thread is active and I work with almost the same problem now here is my question:

How can I visualize those common intervals from many bed files?

I would imagine something like

file1) --------------       ------------------- ------

file2) ------    ----------    -----------   ----------

file3) ---------- ------------- --      ---    ---

Where - indicates present in file, and empty space, as no coverage at all

ADD COMMENTlink written 3.7 years ago by stolarek.ir600

A general approach can be done with BEDOPS bedmap --count, which generalizes to N input files:

$ N=`ls *.bed | wc -l`
$ bedops --everything A.bed B.bed C.bed ... N.bed \
    | bedmap --count --echo --delim '\t' - \
    | uniq \
    | awk -vN=${N} '$1==N' \
    | cut -f2- \
    > common.bed

By changing the test in the awk statement, this approach can be modified to return other subsets of the input's power set, e.g., all elements common to N-1 inputs, N-2 inputs, etc.

This approach is linear and doesn't create intermediate files, so it should be generally preferable to alternative approaches that make N^2 or more passes, if your time is valuable to you.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds28k

yea, I'm just asking of the last part: generating the plot. I did some googling, and maybe plotgff, and some iranges tricks to put those reduced ranges on one plot.

On a side note however, I resolved my common intervals problem with:
'convert2bed' from bamutils, then 'bedops intersect' and passed the same file twice to get reduced intervals in the file. Last step 'bedtools multiinter -header -empty -g genome file' on all those reduced files to find common intervals, how long they are and to which files exactly they belong

ADD REPLYlink written 3.7 years ago by stolarek.ir600
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour