Question: Stats on the overlapping length within two bed files
0
gravatar for beanavarro85
17 months ago by
European Union
beanavarro850 wrote:

Hi all,

I have two bed files, and I have run bedtools intersect to calculate the overlapping regions.

bedtools intersect -a fileA.bed -b fileB.bed -wao  > overlaps.bed

I need to calculate the number of overlaps that are longer than 100 bp (i.e. value higher than 100 in the 8th column of the overlaps.bed file, if I am right) per each feature in fileA.bed, as well as the total length covered by those overlaps.

Any ideas?

Thanks

bedops genome bedtools • 833 views
ADD COMMENTlink modified 17 months ago by Alex Reynolds28k • written 17 months ago by beanavarro850

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. Formatting bar

ADD REPLYlink written 17 months ago by RamRS23k

Thanks! I will next time, sorry about that.

ADD REPLYlink written 17 months ago by beanavarro850
1
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Via BEDOPS and awk:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }'

The bedmap command takes all overlaps between map.bed and an element in reference.map, and applies the --echo-overlap-size operator, which prints the sizes of all overlapping elements.

The awk command filters overlap sizes by lengths greater than 100, sums their sizes and prints the sum and the number of filtered elements.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Alex Reynolds28k

Awesome!

Thanks, it worked perfectly.

Since the order won't change, I have pasted this two columns to my reference.bed file like this:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' > tmp
$ paste reference.bed tmp
ADD REPLYlink written 17 months ago by beanavarro850

Cool. Here's a modified version that takes out a step of making the intermediate file tmp, which would help speed it up some more and eliminate generating tmp:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' - | paste reference.bed - > answer.bed

The hyphen - is a common placeholder for standard input from upstream processes. This placeholder can be used with BEDOPS tools like bedops, bedmap etc. as well as with common Unix utilities, like awk, paste, gunzip, etc.

Piping input from one process to the next in this way is a good way to skip making intermediate files, which can take time to generate and unnecessarily use up disk space.

ADD REPLYlink written 17 months ago by Alex Reynolds28k

I was sure there was a way to do it in one line, but didn't know how!

ADD REPLYlink written 17 months ago by beanavarro850
0
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Here's another way that's a little simpler, perhaps:

$ bedmap --bp-ovr 100 --count --echo-overlap-size --delim "\t" reference.bed map.bed | awk '{ s=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if(a[i]>=100) { s+=a[i]; } print s"\t"$1; }'

The --bp-ovr 100 operator in bedmap requires 100 bases of overlap between elements in map.bed and the element in reference.bed. The --echo-overlap-size operator behaves as described in my previous answer.

The resulting overlap sizes are then passed to awk, which prints the count of qualifying overlaps and the sum of their sizes.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1604 users visited in the last hour