I have a large BED file with many scored records which tend to be clustered and overlap. I want to filter the file to find the maximum number of non-overlapping records with high scores. I cannot use something like bedtools merge because I want the original record positions, not merged positions of all the clustered records, and because bedtools merge groups together nearby records which do not overlap but are bridged by a third record. Furthermore bedtools does not seem to have an option to disable grouping of 'bookended' records (those which touch but do not overlap), which I want to keep separate.
The best idea I have come up with so far is to use bedtools cluster to group records, then filter for the highest scoring record per cluster, then use intersect -v to find records from the original file which do not overlap specifically with the highest scoring records, then again cluster these 'secondary' records and pick the highest scoring per cluster again, and add these to the primary high scoring records. I guess it might take many rounds of doing this to end up with the optimum number of high scoring non-overlapping records. So my question is, does anyone know a better way?
An example from my BED file might look like this:
Chr1 102 115 motif 65 + Chr1 104 118 motif 50 + Chr1 110 125 motif 40 + Chr1 117 130 motif 60 +
Because they all indirectly overlap, bedtools cluster will group all of these together, but the second and last on their own do not actually overlap, but are just bridged by the two in the middle.
What I've come up with so far is this, its pretty slow because the file is huge and the clusters are actually a lot bigger than my example:
#!/bin/bash bedtools cluster -i CHO-K1-GQs_scored.bed | python max_of_cluster.py | sort -k1,1 -k2,2n > CHO-K1-GQs_scored.max.bed while true; do bedtools intersect -v -a CHO-K1-GQs_scored.bed -b CHO-K1-GQs_scored.max.bed | bedtools cluster -i - | python max_of_cluster.py | sort -k1,1 -k2,2n > CHO-K1-GQs_scored.temp.bed if [ -s CHO-K1-GQs_scored.temp.bed ]; then break fi cat CHO-K1-GQs_scored.max.bed CHO-K1-GQs_scored.temp.bed | sort -k1,1 -k2,2n > CHO-K1-GQs_scored.maxtemp.bed mv CHO-K1-GQs_scored.maxtemp.bed CHO-K1-GQs_scored.max.bed done
So far it's done 15 loops and is still returning a lot of new records per loop :/