Question

Split bed file by unique first three columns

0

Entering edit mode

6.6 years ago

graeme.thorn ▴ 100

I have a long BED-like file generated by bedtools intersect which has the following form:

chr1 1 100     chr1 1 10 0.9
chr1 1 100     chr1 11 20 0.5
chr1 1 100     chr1 21 30 0.92
....
chr1 1 100     chr1 91 100 0.3
chr1 101 200   chr1 101 110 0.3
....
chr1 101 200   chr1 191 200 0.1
chr1 201 451   chr1 201 210 0
....
etc

with multiple rows per region demarcated by the first three columns. I would like to split this into individual files per region (first three columns), so in this case, there'd be a file for chr1.1-100.bed, a file for chr1.101-200.bed, one for chr1.201-451.bed, each consisting of those lines with the region as the first three columns. Is there a quick way of doing this? I could knock something up in R (as I'm not too familiar with python), but there may be a faster way.

bedtools bed split • 2.0k views

ADD COMMENT • link updated 6.6 years ago by Pierre Lindenbaum 161k • written 6.6 years ago by graeme.thorn ▴ 100

2

Entering edit mode

6.6 years ago

Pierre Lindenbaum 161k

sed 's/\t/_/;s/\t/_/' input.bed  |while IFS='' read -a F ; do B=`echo "$F" | cut -f1`; echo "${F}" | cut -f 2- >> "${B}.bed"; done

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 161k

score 3 · Accepted Answer · 2017-09-30

Assuming BEDOPS bedops and sort-bed, GNU awk, and bash:

$ sort-bed regions.unsorted.bed > regions.bed
$ awk '{ k=$1"."$2"."$3; if (!(a[k])) { print $0; a[k]+=1; } }' regions.bed | cut -f1-3 | sort-bed - > unique.bed

Then:

$ while read -r region; do \
    key=`echo ${region} | tr " " "."`; \
    echo -e "${region}" | bedops -e 100% regions.bed - > ${key}.bed; \
  done < unique.bed

This creates three files from each unique interval from your sample input:

$ more chr1.1.100.bed 
chr1    1       100     chr1    1       10      0.9
chr1    1       100     chr1    11      20      0.5
chr1    1       100     chr1    21      30      0.92
chr1    1       100     chr1    91      100     0.3

$ more chr1.101.200.bed
chr1    101     200     chr1    101     110     0.3
chr1    101     200     chr1    191     200     0.1

$ more chr1.201.451.bed 
chr1    201     451     chr1    201     210     0

Tools that natively support Unix streams make this easy to do.