Looking for a good way to take a set of intervals and print out an interval set (bed file) that represents regions just upstream and downstream of every interval in the original file, lets say 10kb up and downstream. Any help appreciated, thanks!
Looking for a good way to take a set of intervals and print out an interval set (bed file) that represents regions just upstream and downstream of every interval in the original file, lets say 10kb up and downstream. Any help appreciated, thanks!
As with most interval operations, bedtools has a command for it:
https://bedtools.readthedocs.io/en/latest/content/tools/flank.html
awk -F '\t' '{X=10000; B=int($2);E=int($3);printf("%s\t%d\t%d\n%s\t%d\t%d\n",$1,B-X<0?0:B-X,B,$1,E,E+X);}'
If you are using R, you can do it with the flank
function in GenomicRanges. It takes into account the chromosome lengths, if present.
Yep, just use BEDOPS bedmap --range
to map padded elements:
$ bedmap --skip-unmapped --echo-map --range 10000 reference.map map.bed | awk '(!a[$0]++)' | sort-bed - > answer.bed
We use awk
to strip duplicates from unsorted results. Sorting is necessary because we use --echo-map
, where mapped elements can be returned out of order.
The file answer.bed
will contain unique elements from map.bed
that overlap elements from a 10kb-padded version of reference.bed
.
Here's another approach that uses bedops --range
:
$ bedops --merge reference.bed | bedops --range 10000 - | bedops --element-of 1 map.bed - > answer.bed
The file answer.bed
will contain unique elements from map.bed
that overlap elements from a 10kb-padded version of reference.bed
. Adjust padding, as needed.
Merging the reference intervals before padding should handle overlaps, which avoids the need to filter duplicates and resort. So this should work faster than using bedmap --range
, I think.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I didn't know that one, thanks !