Question

Finding max number of high scoring, non-overlapping BED records

0

Entering edit mode

8.6 years ago

mparker2 ▴ 20

I have a large BED file with many scored records which tend to be clustered and overlap. I want to filter the file to find the maximum number of non-overlapping records with high scores. I cannot use something like bedtools merge because I want the original record positions, not merged positions of all the clustered records, and because bedtools merge groups together nearby records which do not overlap but are bridged by a third record. Furthermore bedtools does not seem to have an option to disable grouping of 'bookended' records (those which touch but do not overlap), which I want to keep separate.

The best idea I have come up with so far is to use bedtools cluster to group records, then filter for the highest scoring record per cluster, then use intersect -v to find records from the original file which do not overlap specifically with the highest scoring records, then again cluster these 'secondary' records and pick the highest scoring per cluster again, and add these to the primary high scoring records. I guess it might take many rounds of doing this to end up with the optimum number of high scoring non-overlapping records. So my question is, does anyone know a better way?

Edit:

An example from my BED file might look like this:

Chr1 102 115 motif 65 +
Chr1 104 118 motif 50 +
Chr1 110 125 motif 40 +
Chr1 117 130 motif 60 +

Because they all indirectly overlap, bedtools cluster will group all of these together, but the second and last on their own do not actually overlap, but are just bridged by the two in the middle.

What I've come up with so far is this, its pretty slow because the file is huge and the clusters are actually a lot bigger than my example:

#!/bin/bash
bedtools cluster -i CHO-K1-GQs_scored.bed |
python max_of_cluster.py |
sort -k1,1 -k2,2n > CHO-K1-GQs_scored.max.bed

while true; do
    bedtools intersect -v -a CHO-K1-GQs_scored.bed -b CHO-K1-GQs_scored.max.bed |
    bedtools cluster -i - |
    python max_of_cluster.py |
    sort -k1,1 -k2,2n > CHO-K1-GQs_scored.temp.bed
    if [ -s CHO-K1-GQs_scored.temp.bed ]; then
        break
    fi
    
    cat CHO-K1-GQs_scored.max.bed CHO-K1-GQs_scored.temp.bed | sort -k1,1 -k2,2n > CHO-K1-GQs_scored.maxtemp.bed
    mv CHO-K1-GQs_scored.maxtemp.bed CHO-K1-GQs_scored.max.bed

done

So far it's done 15 loops and is still returning a lot of new records per loop :/

bed bedtools • 2.7k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by mparker2 ▴ 20

score 0 · Answer 1 · 2015-09-30

You can tale advantage of the "-o" option of bedtools merge allowing you to perform several operations on the merged features, like reporting their start and end so you can keep the original information. You can specificy many operation on the same column too. For instance, you can count the nb of merged elements allowing you to get non-overlapping ones (with count =1) and many other things.

If you can share an example of your bed file with a example case, we can help you.

Ram · Answer 2 · 2015-09-30

Maybe you could do something like this to find all non-overlapping elements in n files:

$ bedops -u a.bed b.bed ... n.bed | bedmap --count --echo --delim '\t' - | awk '$1==1' | cut -f2- > disjoint.bed

Then score the elements in disjoint.bed over a series of sliding windows, which will identify high-scoring ("enriched") regions that might bear closer inspection. This shows where high scoring elements accumulate and may reduce how much space to search through for further optimization.

Say you are working with genome build hg19, which has the following extents:

You could generate a 20 kbase window, sliding over every 10 kb, with the --chop and --stagger options in bedops against the extents file - sweeping the windows over each chromosome in the genome - piping those windows to bedmap to score:

$ bedops --chop 20000 --stagger 10000 hg19.extents.bed | bedmap --echo --max --delim '\t' - disjoint.bed > scored_windows.bed

The scored_windows.bed file contains each window, and the maximum score of any element in disjoint.bed that overlaps that window. If maximum score isn't the criteria, there are other statistical operations that may be useful: --sum, --mean, --median, --tmean (trimmed mean) etc.

You might investigate different scales of sliding windows, plotting where and how signal accumulates at those different scales.

You might subsequently look closer into highly-enriched regions with bedmap --echo-map and bedmap --echo-map-size or use bedmap --count to quantify the number of disjoint regions over those enriched windows, etc.