How to report genomic intervals essentailly overlapping in all files?
1
1
Entering edit mode
3.3 years ago

Consider the following example from bedtools intersect help page. Say, there are 4 bed files as follows

$ cat query.bed
chr1  1   20
chr1  40  45
chr1  70  90
chr1  105 120
chr2  1   20
chr2  40  45
chr2  70  90
chr2  105 120
chr3  1   20
chr3  40  45
chr3  70  90
chr3  105 120
chr3  150 200
chr4  10  20

--

$ cat d1.bed
chr1  5   25
chr1  65  75
chr1  95  100
chr2  5   25
chr2  65  75
chr2  95  100
chr3  5   25
chr3  65  75
chr3  95  100

--

$ cat d2.bed
chr1  40  50
chr1  110 125
chr2  40  50
chr2  110 125
chr3  40  50
chr3  110 125

--

$ cat d3.bed
chr1  85  115
chr2  85  115
chr3  85  115

If we use bedtools like this

$ bedtools intersect -a query.bed \
    -b d1.bed d2.bed d3.bed
chr1  5   20
chr1  40  45
chr1  70  75
chr1  85  90
chr1  110 120
chr1  105 115
chr2  5   20
chr2  40  45
chr2  70  75
chr2  85  90
chr2  110 120
chr2  105 115
chr3  5   20
chr3  40  45
chr3  70  75
chr3  85  90
chr3  110 120
chr3  105 115

It reports the interval for e.g. chr1 5 20 , however, I don't want that to be reported, because it is not present in d3.bed. I know that it is being reported because it is overlapping in query.bed and d1.bed. But my requirement is different. How can I tweak bedtools to get ONLY those records which are overlapping across all files.

bedtools bed bam • 1.3k views
ADD COMMENT
1
Entering edit mode
bedtools intersect -wa -a query.bed -b d1.bed | bedtools intersect -wa -a - -b d2.bed | bedtools intersect -wa -a - -b d3.bed
ADD REPLY
0
Entering edit mode

This works but I cannot keep on writing names of all bed files one by one when I have 100's of files!

ADD REPLY
0
Entering edit mode

I know right :) But you should have had in your post that you got more than 3 bed files

ADD REPLY
0
Entering edit mode

It was a toy example from bedtools page as I had mentioned. Thanks Bastien Hervé for a quick response. I highly appreciate that :D

ADD REPLY
0
Entering edit mode

I offered a solution that generalizes to many files but ended up deleting it. Sorry and good luck!

ADD REPLY
3
Entering edit mode
3.3 years ago
jean.elbers ★ 1.7k

Try the following

bedtools multiinter -i *.bed > test1
files="$(ls *.bed|wc -l)"
awk  -v OFS='\t' -v files=$files '$4==files' test1 |cut -f 1-3 > test2

What it does:

bedtools multiinter -i *.bed > test1 # use bedtools multi intersect to get statistics on intersections of all bed files
files="$(ls *.bed|wc -l)" # make a variable called files that is the number of bed files in the directory
awk  -v OFS='\t' -v files=$files '$4==files' test1 |cut -f 1-3 > test2 # get only regions that are found in all of the files (column 4 is num, which is the number of input files that have that region)

test2 is the file that contains regions found in all files assuming that the only bed files in the directory are query.bed, d1.bed, d2.bed, and d3.bed

ADD COMMENT
0
Entering edit mode

Hi jean.elbers

I tried this so far,

bioinfo$ for i in *.bed ; do echo $i ; echo ; cat $i ; echo ; done
test1.bed

chr1    20  30

test2.bed

chr1    15  25

test.bed

chr1    10  30

--

bioinfo$ bedtools multiinter -i test1.bed test2.bed  test.bed 
chr1    10  15  1   3   0   0   1
chr1    15  20  2   2,3 0   1   1
chr1    20  25  3   1,2,3   1   1   1
chr1    25  30  2   1,3 1   0   1

--

bioinfo$ bedtools multiinter -i test1.bed test2.bed  test.bed  > out

--

bioinfo$ cat out
chr1    10  15  1   3   0   0   1
chr1    15  20  2   2,3 0   1   1
chr1    20  25  3   1,2,3   1   1   1
chr1    25  30  2   1,3 1   0   1

bioinfo$ files="$(ls *.bed|wc -l)"

--

bioinfo$ echo $files
3

--

But below command generates nothing

bioinfo$ awk  -v OFS='\t' -v files=files '$4==files' out
ADD REPLY
1
Entering edit mode

Hmmm... does this work?

awk  -v OFS='\t'  '$4==3' out | cut -f 1-3
ADD REPLY
0
Entering edit mode

Works! But how can I modify such that 3 is replaced by the "count of file" ? This is because I want to automate that !!

Thanks

ADD REPLY
1
Entering edit mode

I fixed the command. I forgot $ before $files. See above for fixed version

ADD REPLY
0
Entering edit mode

Seems like you forgot to make the change

from

bioinfo$ awk  -v OFS='\t' -v files=files '$4==files' out

to

bioinfo$ awk  -v OFS='\t' -v files=$files '$4==files' out

correct ?

ADD REPLY
1
Entering edit mode

Yes. It is hard to do on the phone right now. Sorry.

ADD REPLY

Login before adding your answer.

Traffic: 1656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6