Question: How to report genomic intervals essentailly overlapping in all files?
1
gravatar for Vijay Lakhujani
6 weeks ago by
Vijay Lakhujani4.0k
India
Vijay Lakhujani4.0k wrote:

Consider the following example from bedtools intersect help page. Say, there are 4 bed files as follows

$ cat query.bed
chr1  1   20
chr1  40  45
chr1  70  90
chr1  105 120
chr2  1   20
chr2  40  45
chr2  70  90
chr2  105 120
chr3  1   20
chr3  40  45
chr3  70  90
chr3  105 120
chr3  150 200
chr4  10  20

--

$ cat d1.bed
chr1  5   25
chr1  65  75
chr1  95  100
chr2  5   25
chr2  65  75
chr2  95  100
chr3  5   25
chr3  65  75
chr3  95  100

--

$ cat d2.bed
chr1  40  50
chr1  110 125
chr2  40  50
chr2  110 125
chr3  40  50
chr3  110 125

--

$ cat d3.bed
chr1  85  115
chr2  85  115
chr3  85  115

If we use bedtools like this

$ bedtools intersect -a query.bed \
    -b d1.bed d2.bed d3.bed
chr1  5   20
chr1  40  45
chr1  70  75
chr1  85  90
chr1  110 120
chr1  105 115
chr2  5   20
chr2  40  45
chr2  70  75
chr2  85  90
chr2  110 120
chr2  105 115
chr3  5   20
chr3  40  45
chr3  70  75
chr3  85  90
chr3  110 120
chr3  105 115

It reports the interval for e.g. chr1 5 20 , however, I don't want that to be reported, because it is not present in d3.bed. I know that it is being reported because it is overlapping in query.bed and d1.bed. But my requirement is different. How can I tweak bedtools to get ONLY those records which are overlapping across all files.

bam bed bedtools • 256 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Vijay Lakhujani4.0k
1
bedtools intersect -wa -a query.bed -b d1.bed | bedtools intersect -wa -a - -b d2.bed | bedtools intersect -wa -a - -b d3.bed
ADD REPLYlink written 6 weeks ago by Bastien Hervé4.0k

This works but I cannot keep on writing names of all bed files one by one when I have 100's of files!

ADD REPLYlink written 6 weeks ago by Vijay Lakhujani4.0k

I know right :) But you should have had in your post that you got more than 3 bed files

ADD REPLYlink written 6 weeks ago by Bastien Hervé4.0k

It was a toy example from bedtools page as I had mentioned. Thanks Bastien Hervé for a quick response. I highly appreciate that :D

ADD REPLYlink written 5 weeks ago by Vijay Lakhujani4.0k

I offered a solution that generalizes to many files but ended up deleting it. Sorry and good luck!

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Alex Reynolds28k
2
gravatar for jean.elbers
6 weeks ago by
jean.elbers970
jean.elbers970 wrote:

Try the following

bedtools multiinter -i *.bed > test1
files="$(ls *.bed|wc -l)"
awk  -v OFS='\t' -v files=$files '$4==files' test1 |cut -f 1-3 > test2

What it does:

bedtools multiinter -i *.bed > test1 # use bedtools multi intersect to get statistics on intersections of all bed files
files="$(ls *.bed|wc -l)" # make a variable called files that is the number of bed files in the directory
awk  -v OFS='\t' -v files=$files '$4==files' test1 |cut -f 1-3 > test2 # get only regions that are found in all of the files (column 4 is num, which is the number of input files that have that region)

test2 is the file that contains regions found in all files assuming that the only bed files in the directory are query.bed, d1.bed, d2.bed, and d3.bed

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by jean.elbers970

Hi jean.elbers

I tried this so far,

bioinfo$ for i in *.bed ; do echo $i ; echo ; cat $i ; echo ; done
test1.bed

chr1    20  30

test2.bed

chr1    15  25

test.bed

chr1    10  30

--

bioinfo$ bedtools multiinter -i test1.bed test2.bed  test.bed 
chr1    10  15  1   3   0   0   1
chr1    15  20  2   2,3 0   1   1
chr1    20  25  3   1,2,3   1   1   1
chr1    25  30  2   1,3 1   0   1

--

bioinfo$ bedtools multiinter -i test1.bed test2.bed  test.bed  > out

--

bioinfo$ cat out
chr1    10  15  1   3   0   0   1
chr1    15  20  2   2,3 0   1   1
chr1    20  25  3   1,2,3   1   1   1
chr1    25  30  2   1,3 1   0   1

bioinfo$ files="$(ls *.bed|wc -l)"

--

bioinfo$ echo $files
3

--

But below command generates nothing

bioinfo$ awk  -v OFS='\t' -v files=files '$4==files' out
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Vijay Lakhujani4.0k
1

Hmmm... does this work?

awk  -v OFS='\t'  '$4==3' out | cut -f 1-3
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by jean.elbers970

Works! But how can I modify such that 3 is replaced by the "count of file" ? This is because I want to automate that !!

Thanks

ADD REPLYlink written 6 weeks ago by Vijay Lakhujani4.0k
1

I fixed the command. I forgot $ before $files. See above for fixed version

ADD REPLYlink written 6 weeks ago by jean.elbers970

Seems like you forgot to make the change

from

bioinfo$ awk  -v OFS='\t' -v files=files '$4==files' out

to

bioinfo$ awk  -v OFS='\t' -v files=$files '$4==files' out

correct ?

ADD REPLYlink written 6 weeks ago by Vijay Lakhujani4.0k
1

Yes. It is hard to do on the phone right now. Sorry.

ADD REPLYlink written 6 weeks ago by jean.elbers970
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1035 users visited in the last hour