Edit: 10.12.18 => Modified the code snippet to work with an arbitrary number of BED files:
This is the code to work with two files. It does the intersection outputting the entire input intervals of -a
and -b
in case of overlap, and then use awk
to select the smaller of the two start coordinates and the larger of the two end coordinates. The awk
part can for sure be written more elegantly.
bedtools intersect -a sample1.bed -b sample2.bed -wa -wb | \
awk 'OFS="\t" {if ($2 < $5) print $1, $2, $3, $6; else print $1, $5, $3, $6}' | \
awk 'OFS="\t" {if ($3 > $4) print $1, $2, $3; else print $1, $2, $4}'
To work with an arbitrary number of files, you basically do the same thing. First, you intersect the first two files with the above code and store the result to disk. Then you loop through all other BED files and apply the same code to intersect the intermediate result with the current BED file, storing the result again to disk. Repeat until all BED files have been processed. For this, all files in the current directory that are to be processed hve the .bed
suffix:
#!/bin/bash
## Code as in my first answer, just wrapped into a function:
function MyIntersect {
bedtools intersect -a $1 -b $2 -wa -wb | \
awk 'OFS="\t" {if ($2 < $5) print $1, $2, $3, $6; else print $1, $5, $3, $6}' | \
awk 'OFS="\t" {if ($3 > $4) print $1, $2, $3; else print $1, $2, $4}'
}; export -f MyIntersect
## Write names of all BED files to a file:
ls *.bed > all_beds.txt
## Get total number of BED files in current directory:
TOTALNUMBER="$(cat all_beds.txt | wc -l | sed -e 's/^[ \t]*//')"
## Start with the first two files manually (head -n1 gets the first file, head -n2 | tail -n 1 the second one from all_beds.txt):
MyIntersect "$(head -n 1 all_beds.txt)" "$(head -n 2 all_beds.txt | tail -n 1)" > intermediate.txt
## and now do the same thing with this intermediate result from the command above until all other BED files have been used:
for i in `seq 3 $TOTALNUMBER`
do
MyIntersect intermediate.txt "$(head -n $i all_beds.txt | tail -n 1)" > intermediate_new.txt
mv intermediate_new.txt intermediate.txt
done
## Output is final.bed
mv intermediate.txt final.bed
Thank you very much! will this also work if I have more than 2 .bed files?
No, because as can be seen from the awk command, this is tailored for a pairwise comparison of two files. If you have multiple files, one would probably need to do something with
bedtools multiinter
in a two-step process, so first usingmultiinter
to identify regions that overlap in at least a certain number or all samples and then use the common interval to do another round of intersect against all files to get the flanks. Can you provide some more details on how many files you have, and what the criteria are?For This current project I have in total 3 different files but for other projects it is possible that I have more than five repeats .. So I need a solution that works for a varying amount of files ...
I edited my answer to make it work for an arbitrary number of BED files in the current directory.