Subset overlapping regions to find the minimal number of non overlapping regions
7 weeks ago
Geoffrey • 0

I have a large set of regions, many of which are overlapping. When they are overlapping I am trying to subset them such that I have the minimal number of non overlapping regions.

I am not sure the best way to go about this. I thought maybe I could do bedtools merge and then select the largest of the merged regions? I could then do bedtools intersect -v on the subset of regions and add the two together? Since this could bring in regions which themselves have overlaps I would then need to iterate on this untill I got what I wanted. Also selecting the largest might bot give the best path.

Is there a more elegant way to do this?

Region diagram

Example data:

chr1    15627   2015626 TRF_197022
chr1    15628   43383   TRF_197021
chr1    43845   44514   TRF_197027
chr1    44503   355335  TRF_197029
chr1    355339  356932  TRF_197079
chr1    356933  450858  TRF_197081
chr1    450888  455989  TRF_197096
chr1    450888  455989  TRF_197095
chr1    458068  458111  TRF_197101
chr1    458068  458111  TRF_197100
chr1    458253  458301  TRF_197102
chr1    458798  458823  TRF_197103
chr1    458920  458947  TRF_197104
chr1    459055  459093  TRF_197105
chr1    468536  519257  TRF_197110
chr1    519259  2507432 TRF_197117
chr1    598171  599043  TRF_197129
chr1    639285  2639284 TRF_197226
chr1    1686777 1686809 TRF_197185
chr1    1687030 1687057 TRF_197186
chr1    1687706 1687770 TRF_197188
chr1    1687717 1687770 TRF_197187
chr1    1687828 1687861 TRF_197190
chr1    1734806 1734853 TRF_197193
chr1    1736459 1736506 TRF_197195
chr1    2507429 3300067 TRF_197221
chr1    2675012 2676396 TRF_197228
chr1    2676387 3320976 TRF_197230
overlap bedtools intersect
it's not clear how you choose to merge or not some regions.

I do not want to merge them. I want to filter them such that I retain a subset that cover the most area but any one position only has, at most, one region annotated.

I mentioned bedtools merge because as well as creating a new large region which encompasses all the previous regions it also returns a list of previous regions subsumed into the new merged region. This serves as a mechanism to identifying groups of the overlapping original regions


