Intersect multiple BED files
4
3
Entering edit mode
5.8 years ago
int11ap1 ▴ 440

I have a directory with 50 bed files. I want to get all regions in common in at least 50% of my files. How would I get that? Using bedopts I can intersect all, but I want at least those existing regions in 25 files.

bed intersect • 4.9k views
6
Entering edit mode
5.8 years ago

This isn't counting the fraction of overlap between intervals, so you can't directly use available options in bedops or bedmap, but you can use some ID tricks to uniquely associate an interval with the BED file it comes from, and use that ID information to do counting and thresholding.

Here is one way to do so:

1. Label your input BED files so that their IDs uniquely identify their intervals, e.g., for BED files called 1.bed, 2.bed, etc.:

$for idx in seq 1 50; do cut -f1-3$idx.bed | awk -vidx=$idx '{ print$0"\t"idx; }' > $idx.id.bed; done  You would modify this loop depending on how you name your files. Basically, however you do this, the goal is to print the interval in the first three columns, and use the number of files you have as a unique identifier in the fourth column. You would print 1 in the fourth column of the first file 1.id.bed, and 5 in the fourth column of the fifth file 5.id.bed, and so on. I use numbers here, but you can use any unique identifier you want, instead. Cell type. Experiment or sample name, etc. The key part to making this work is that the chosen identifier is unique to that input file and that file only. 2. Take the union of all these ID-tagged files with BEDOPS bedops --everything and pipe the result to BEDOPS bedmap with the --echo-map-id-uniq operation: $ bedops --everything *.id.bed | bedmap --echo --echo-map-id-uniq --delim '\t' - > all.bed


What you have at this point, for every interval from every input BED file, is a file called all.bed, which shows that interval in the first three columns, the file it comes from in the fourth column, and a fifth column that is a semi-colon delimited string of every unique ID value that overlaps that interval, e.g.:

...
chrN     123    456     5     2;5;6;7;11;12;...;46;48
chrN     234    567     5     5;6;7;8
...


Because we use --echo-map-id-uniq, this string contains unique identifiers - no duplicates. This uniqueness property is how we can count how many of the original input files overlap.

3. To get an interval that has regions in common with at least 50% of your 50 input files - 25 files other than the original file the interval comes from - you can use awk to split this ID string in the fifth column by the semi-colon delimiter,* *counting how many unique IDs you find and printing the interval, if the threshold of 26 ID values is met (25 IDs other than the ID representing the original interval's source):

$awk -vthreshold=26 '{ \ n_ids = split($5, ids, ";"); \
if (n_ids >= threshold) { \
print $0; \ } \ }' all.bed > thresholded.bed  A shorter, cleaner awk statement with the same functionality would be: $ awk -vthreshold=26 '(split($5, ids, ";") >= threshold)' all.bed > thresholded.bed  Bonus: If you need to get back any additional columns of the original interval, you can just do another bedmap operation on thresholded.bed on a union of the 50 input BED files. $ bedops --everything 1.bed 2.bed ... 50.bed > union.bed
$bedmap --echo --exact union.bed thresholded.bed > union-thresholded.bed  Note: Input files should be sorted before use with BEDOPS tools, if the sort state is unknown or different. BEDOPS includes a tool called sort-bed that sorts BED files faster than GNU sort or alternatives. $ sort-bed < foo.unsorted.bed > foo.sorted.bed


You can use whatever you like here, so long as the BEDOPS sort criteria are met before passing data to bedops and bedmap.

0
Entering edit mode

So useful, thanks a lot!!

0
Entering edit mode

You're welcome!

3
Entering edit mode
5.8 years ago

Have a look at this post: Bedtools Compare Multiple Bed Files?

2
Entering edit mode
5.8 years ago
stolarek.ir ▴ 670

Since this thread is active and I work with almost the same problem now here is my question:

How can I visualize those common intervals from many bed files?

I would imagine something like

file1) --------------       ------------------- ------
file2) ------    ----------    -----------   ----------
file3) ---------- ------------- --      ---    ---


Where - indicates present in file, and empty space, as no coverage at all

0
Entering edit mode

A general approach can be done with BEDOPS bedmap --count, which generalizes to N input files:

$N=ls *.bed | wc -l$ bedops --everything A.bed B.bed C.bed ... N.bed \
| bedmap --count --echo --delim '\t' - \
| uniq \
| awk -vN=${N} '$1==N' \
| cut -f2- \
> common.bed


By changing the test in the awk statement, this approach can be modified to return other subsets of the input's power set, e.g., all elements common to N-1 inputs, N-2 inputs, etc.

This approach is linear and doesn't create intermediate files, so it should be generally preferable to alternative approaches that make N^2 or more passes, if your time is valuable to you.

0
Entering edit mode

yea, I'm just asking of the last part: generating the plot. I did some googling, and maybe plotgff, and some iranges tricks to put those reduced ranges on one plot.

On a side note however, I resolved my common intervals problem with:
'convert2bed' from bamutils, then 'bedops intersect' and passed the same file twice to get reduced intervals in the file. Last step 'bedtools multiinter -header -empty -g genome file' on all those reduced files to find common intervals, how long they are and to which files exactly they belong

1
Entering edit mode
5.8 years ago
Amitm ★ 2.1k

hi,

Using (recent versions) bedtools intersect, give one BED as 1st arg. and the rest 49 as 2nd arg. And use the -c parameter. That will give you the count of overlaps across the files. Then you can use that to apply your threshold.

Ofcourse the above works assuming you dont get multiple overalps from a singel BED. So you can first create unique intervals of each BED and then do the above.