Question: Stats on the overlapping length within two bed files
0
gravatar for beanavarro85
12 months ago by
European Union
beanavarro850 wrote:

Hi all,

I have two bed files, and I have run bedtools intersect to calculate the overlapping regions.

bedtools intersect -a fileA.bed -b fileB.bed -wao  > overlaps.bed

I need to calculate the number of overlaps that are longer than 100 bp (i.e. value higher than 100 in the 8th column of the overlaps.bed file, if I am right) per each feature in fileA.bed, as well as the total length covered by those overlaps.

Any ideas?

Thanks

bedops genome bedtools • 647 views
ADD COMMENTlink modified 12 months ago by Alex Reynolds27k • written 12 months ago by beanavarro850

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. Formatting bar

ADD REPLYlink written 12 months ago by RamRS20k

Thanks! I will next time, sorry about that.

ADD REPLYlink written 12 months ago by beanavarro850
1
gravatar for Alex Reynolds
12 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Via BEDOPS and awk:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }'

The bedmap command takes all overlaps between map.bed and an element in reference.map, and applies the --echo-overlap-size operator, which prints the sizes of all overlapping elements.

The awk command filters overlap sizes by lengths greater than 100, sums their sizes and prints the sum and the number of filtered elements.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Alex Reynolds27k

Awesome!

Thanks, it worked perfectly.

Since the order won't change, I have pasted this two columns to my reference.bed file like this:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' > tmp
$ paste reference.bed tmp
ADD REPLYlink written 12 months ago by beanavarro850

Cool. Here's a modified version that takes out a step of making the intermediate file tmp, which would help speed it up some more and eliminate generating tmp:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' - | paste reference.bed - > answer.bed

The hyphen - is a common placeholder for standard input from upstream processes. This placeholder can be used with BEDOPS tools like bedops, bedmap etc. as well as with common Unix utilities, like awk, paste, gunzip, etc.

Piping input from one process to the next in this way is a good way to skip making intermediate files, which can take time to generate and unnecessarily use up disk space.

ADD REPLYlink written 12 months ago by Alex Reynolds27k

I was sure there was a way to do it in one line, but didn't know how!

ADD REPLYlink written 12 months ago by beanavarro850
0
gravatar for Alex Reynolds
12 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Here's another way that's a little simpler, perhaps:

$ bedmap --bp-ovr 100 --count --echo-overlap-size --delim "\t" reference.bed map.bed | awk '{ s=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if(a[i]>=100) { s+=a[i]; } print s"\t"$1; }'

The --bp-ovr 100 operator in bedmap requires 100 bases of overlap between elements in map.bed and the element in reference.bed. The --echo-overlap-size operator behaves as described in my previous answer.

The resulting overlap sizes are then passed to awk, which prints the count of qualifying overlaps and the sum of their sizes.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Alex Reynolds27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2568 users visited in the last hour