I have a massive file with a lot of regions in bed format. Many of them are overlapping either partially or entirely. I would like to merge these regions, together with their names, in a way that I can see exactly what region (ID'd by the name) contributed to the result.
Using the standard mergeBed doesn't do what I need it to. I can run mergeBed - i bigfile.bed -c1,12 -o count,collapse to get the clustered regions, but that is not what I want. Due to the data I have, there are partial overlaps in many many places and this leads to gargantuan merged intervals that do not specify where each name contributes.
Using this input -
chr1 150 158 A chr1 155 165 B chr1 160 170 C chr1 177 190 D
With the mergeBed I listed above would result in the output -
chr1 150 170 A,B,C 3 chr1 177 190 D, 1
This is unfortunately not what I need.
Ideally, merging the above input would give me the following output (with the 4th column being the name of the region and the 5th being the number of input regions used for calculating the output region) -
chr1 150 154 A 1 chr1 155 158 A,B 2 chr1 159 159 B, 1 chr1 160 165 B,C 2 chr1 166 170 C 1 chr1 177 190 D 1
Note that the regions here are divided by the contributing regions, and you can easily see the original input regions in this result. It is also possible to deduce exactly what region is contributing, such that if you SNPs fall into one of these, you know exactly what is being directly impacted.
Is there any way to do this with existing software?
Thank you in advance