I have a large set of regions, many of which are overlapping. When they are overlapping I am trying to subset them such that I have the minimal number of non overlapping regions.
I am not sure the best way to go about this. I thought maybe I could do bedtools merge
and then select the largest of the merged regions?
I could then do bedtools intersect -v
on the subset of regions and add the two together?
Since this could bring in regions which themselves have overlaps I would then need to iterate on this untill I got what I wanted.
Also selecting the largest might bot give the best path.
Is there a more elegant way to do this?
Example data:
chr1 15627 2015626 TRF_197022
chr1 15628 43383 TRF_197021
chr1 43845 44514 TRF_197027
chr1 44503 355335 TRF_197029
chr1 355339 356932 TRF_197079
chr1 356933 450858 TRF_197081
chr1 450888 455989 TRF_197096
chr1 450888 455989 TRF_197095
chr1 458068 458111 TRF_197101
chr1 458068 458111 TRF_197100
chr1 458253 458301 TRF_197102
chr1 458798 458823 TRF_197103
chr1 458920 458947 TRF_197104
chr1 459055 459093 TRF_197105
chr1 468536 519257 TRF_197110
chr1 519259 2507432 TRF_197117
chr1 598171 599043 TRF_197129
chr1 639285 2639284 TRF_197226
chr1 1686777 1686809 TRF_197185
chr1 1687030 1687057 TRF_197186
chr1 1687706 1687770 TRF_197188
chr1 1687717 1687770 TRF_197187
chr1 1687828 1687861 TRF_197190
chr1 1734806 1734853 TRF_197193
chr1 1736459 1736506 TRF_197195
chr1 2507429 3300067 TRF_197221
chr1 2675012 2676396 TRF_197228
chr1 2676387 3320976 TRF_197230
it's not clear how you choose to merge or not some regions.
I do not want to merge them. I want to filter them such that I retain a subset that cover the most area but any one position only has, at most, one region annotated.
I mentioned
bedtools merge
because as well as creating a new large region which encompasses all the previous regions it also returns a list of previous regions subsumed into the new merged region. This serves as a mechanism to identifying groups of the overlapping original regions