Looking for a good way to take a set of intervals and print out an interval set (bed file) that represents regions just upstream and downstream of every interval in the original file, lets say 10kb up and downstream. Any help appreciated, thanks!
Looking for a good way to take a set of intervals and print out an interval set (bed file) that represents regions just upstream and downstream of every interval in the original file, lets say 10kb up and downstream. Any help appreciated, thanks!
As with most interval operations, bedtools has a command for it:
https://bedtools.readthedocs.io/en/latest/content/tools/flank.html
awk -F '\t' '{X=10000; B=int($2);E=int($3);printf("%s\t%d\t%d\n%s\t%d\t%d\n",$1,B-X<0?0:B-X,B,$1,E,E+X);}'
If you are using R, you can do it with the flank function in GenomicRanges. It takes into account the chromosome lengths, if present.
Yep, just use BEDOPS bedmap --range to map padded elements:
$ bedmap --skip-unmapped --echo-map --range 10000 reference.map map.bed | awk '(!a[$0]++)' | sort-bed - > answer.bed
We use awk to strip duplicates from unsorted results. Sorting is necessary because we use --echo-map, where mapped elements can be returned out of order.
The file answer.bed will contain unique elements from map.bed that overlap elements from a 10kb-padded version of reference.bed.
Here's another approach that uses bedops --range:
$ bedops --merge reference.bed | bedops --range 10000 - | bedops --element-of 1 map.bed - > answer.bed
The file answer.bed will contain unique elements from map.bed that overlap elements from a 10kb-padded version of reference.bed. Adjust padding, as needed.
Merging the reference intervals before padding should handle overlaps, which avoids the need to filter duplicates and resort. So this should work faster than using bedmap --range, I think.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I didn't know that one, thanks !