Question

Stats on the overlapping length within two bed files

0

Entering edit mode

7.4 years ago

beanavarro85 ▴ 10

Hi all,

I have two bed files, and I have run bedtools intersect to calculate the overlapping regions.

bedtools intersect -a fileA.bed -b fileB.bed -wao  > overlaps.bed

I need to calculate the number of overlaps that are longer than 100 bp (i.e. value higher than 100 in the 8th column of the overlaps.bed file, if I am right) per each feature in fileA.bed, as well as the total length covered by those overlaps.

Any ideas?

Thanks

bedtools bedops genome • 4.0k views

ADD COMMENT • link updated 7.4 years ago by Alex Reynolds 36k • written 7.4 years ago by beanavarro85 ▴ 10

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

ADD REPLY • link 7.4 years ago by Ram 45k

0

Entering edit mode

Thanks! I will next time, sorry about that.

ADD REPLY • link 7.4 years ago by beanavarro85 ▴ 10

score 1 · Answer 1 · 2018-03-02

1

Entering edit mode

7.4 years ago

Alex Reynolds 36k

Via BEDOPS and awk:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }'

The bedmap command takes all overlaps between map.bed and an element in reference.map, and applies the --echo-overlap-size operator, which prints the sizes of all overlapping elements.

The awk command filters overlap sizes by lengths greater than 100, sums their sizes and prints the sum and the number of filtered elements.

ADD COMMENT • link 7.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Awesome!

Thanks, it worked perfectly.

Since the order won't change, I have pasted this two columns to my reference.bed file like this:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' > tmp
$ paste reference.bed tmp

ADD REPLY • link 7.4 years ago by beanavarro85 ▴ 10

0

Entering edit mode

Cool. Here's a modified version that takes out a step of making the intermediate file tmp, which would help speed it up some more and eliminate generating tmp:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' - | paste reference.bed - > answer.bed

The hyphen - is a common placeholder for standard input from upstream processes. This placeholder can be used with BEDOPS tools like bedops, bedmap etc. as well as with common Unix utilities, like awk, paste, gunzip, etc.

Piping input from one process to the next in this way is a good way to skip making intermediate files, which can take time to generate and unnecessarily use up disk space.

ADD REPLY • link 7.4 years ago by Alex Reynolds 36k

0

Entering edit mode

I was sure there was a way to do it in one line, but didn't know how!

ADD REPLY • link 7.4 years ago by beanavarro85 ▴ 10

score 0 · Answer 2 · 2018-03-02

Here's another way that's a little simpler, perhaps:

$ bedmap --bp-ovr 100 --count --echo-overlap-size --delim "\t" reference.bed map.bed | awk '{ s=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if(a[i]>=100) { s+=a[i]; } print s"\t"$1; }'

The --bp-ovr 100 operator in bedmap requires 100 bases of overlap between elements in map.bed and the element in reference.bed. The --echo-overlap-size operator behaves as described in my previous answer.

The resulting overlap sizes are then passed to awk, which prints the count of qualifying overlaps and the sum of their sizes.