I have a bedfile with multiple mapped sequences like below:
chr1    100     200     region1    +
chr1    180     300     region2    +
chr1    300     350     region3    +
chr1    320     400     region4    +
chr1    330     410     region5    +
I am using bedtools merge to merge sequences that overlap - however, I really need to set a minimum overlap cut-off. Bedtools merge will overlap bookended sequences, and sequences overlapping by just one basepair, and there is no way to change this.
E.g. regions 1 and 2 overlap by only 20bp, and in this case I would not like them merged. However, regions 4 and 5 mostly overlap, and I would like these sequences to be merged. Its also really important that I am able to retain the ID and strand information in the resulting merged bed entry, as well as enforcing strandedness when merging.
Does anyone know of a program or method to do this? I have also tried bedOPS with no success. I am new to Python programming so not quite sure how I would be able to code this myself!
Not sure if you wanting to do like this-
Note that, in place of
sort-bed - | uniq -, you can pass the--uniqueoption tosort-bed.The input is not strictly a BED6 file, but you can strand-separate it by way of
awk:Repeat for reverse-stranded elements:
May as well sort, if the sort order of
input.bedis unknown.Then run BEDOPS
bedmapon each file, as usual.You can use
bedops --everythingto take the multiset union of the twobedmapresults:The file
answer.bedwill be sorted correctly.To preserve IDs in the fourth column, you can use the
--echo-map-idor--echo-map-id-uniqoperator withbedmap.