Create genome-unique intervals from overlapping records in a bed file
Entering edit mode
12 months ago
c0tton • 0

Let there exist a bed file, a, with 3 overlapping records:


chr1      1      20      s1      1      +      
chr1      5      20      s2      1      +      
chr1      10    20      s3      1      +

I want to write a function which would parse a and return a bed-file, bm containing genome-unique windows along with a list of bed-records which span those windows, i.e.


chr1      1      5      s1      1      +      
chr1      5      10      s1,s2,s3      1      +      
chr1      10      20      s1, s2, s3      1      +      

The goal of the function is to produce genome-unique bed records (essentially an index), from which combinations of records can be used to produce each of the original bed records.

I wouldn't know where to start with something like this. Any suggestions are greatly appreciated.

bedtools ranges BED genomic • 715 views
Entering edit mode
12 months ago

You could use BEDOPS bedops --partition to easily calculate disjoint elements from a set of overlapping elements.

Then you could pass, or pipe the result to bedmap --echo --echo-map-id to get the partitioned intervals and the IDs of elements in a.bed that overlap (or "map" to) the partitioned intervals:

$ bedops --partition a.bed | bedmap --echo --echo-map-id --echo-map --delim '\t' - a.bed | cut -f1-4,9,10 > answer.bed

It's not entirely clear from your example how you want to deal with cases where overlapping elements in a have differences in scores and/or strands, but the above is a simple and fast way to get an answer.

If you don't care about score and strand, the following suffices:

$ bedops --partition a.bed | bedmap --echo --echo-map-id --delim '\t' - a.bed > answer.bed

Your example input is sorted, but your real input might not be. If your starting intervals are unsorted or you don't know their sort order, you can sort with sort-bed to prepare input to use for fast set operations:

$ sort-bed a.unsorted.bed > a.bed

This generates output equivalent to sort -k1,1 -k2,2n -k3,3n, but sort-bed is faster for sorting BED files.

Entering edit mode
12 months ago
 awk '{B=int($2);E=int($3);while(B<E) {printf("%s\t%d\t%d\t%s\n",$1,B,B+1,$4);B++}}' input.bed |\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools groupby -g 1,2,3 -c 4 -o distinct |\
awk '{C=$1;B=int($2);E=int($3);N=$4; if(NR>1 && (C!=PC || B!=PE || N!=PN)) {printf("%s\t%d\t%d\t%s\n",PC,PB,PE,PN); PC=C;PE=E;PB=B;PN=N;} else {PC=C;PE=E;PN=N;} } END {printf("%s\t%d\t%d\t%s\n",PC,PB,PE,PN);}' 

chr1    0   5   s1
chr1    5   10  s1,s2
chr1    10  20  s1,s2,s3
Entering edit mode
12 months ago
c0tton • 0

Thank you both for your replies.

Here's the final code I used in case anyone else wants to achieve something similar:

# split the input bed by strand 
cat $inputBed | awk '($6 == "+")' > $
cat $inputBed | awk '($6 == "-")' > $inputBed.minus

# run Alex's code for each stand
time bedops --partition $ | bedmap --echo --echo-map-id --echo-map --delim '\t' - $| cut -f1-3 | awk '{print $1, $2, $3, "node_+_"NR, "1", "+"}' OFS="\t"  >
time bedops --partition $inputBed.minus | bedmap --echo --echo-map-id --echo-map --delim '\t' - $| cut -f1-3 | awk '{print $1, $2, $3, "node_-_"NR, "1", "-"}' OFS="\t" > nodes.minus.bed

# filter out records where start is less than end 
cat nodes.minus.bed | awk '($3 > $2){print $0}' | bedtools sort -i - > nodes.merged.bed

Thank you both so much for your help!


Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6