Split bed file by unique first three columns
2
0
Entering edit mode
6.6 years ago
graeme.thorn ▴ 100

I have a long BED-like file generated by bedtools intersect which has the following form:

chr1 1 100     chr1 1 10 0.9
chr1 1 100     chr1 11 20 0.5
chr1 1 100     chr1 21 30 0.92
....
chr1 1 100     chr1 91 100 0.3
chr1 101 200   chr1 101 110 0.3
....
chr1 101 200   chr1 191 200 0.1
chr1 201 451   chr1 201 210 0
....
etc

with multiple rows per region demarcated by the first three columns. I would like to split this into individual files per region (first three columns), so in this case, there'd be a file for chr1.1-100.bed, a file for chr1.101-200.bed, one for chr1.201-451.bed, each consisting of those lines with the region as the first three columns. Is there a quick way of doing this? I could knock something up in R (as I'm not too familiar with python), but there may be a faster way.

bedtools bed split • 2.0k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

Assuming BEDOPS bedops and sort-bed, GNU awk, and bash:

$ sort-bed regions.unsorted.bed > regions.bed
$ awk '{ k=$1"."$2"."$3; if (!(a[k])) { print $0; a[k]+=1; } }' regions.bed | cut -f1-3 | sort-bed - > unique.bed

Then:

$ while read -r region; do \
    key=`echo ${region} | tr " " "."`; \
    echo -e "${region}" | bedops -e 100% regions.bed - > ${key}.bed; \
  done < unique.bed

This creates three files from each unique interval from your sample input:

$ more chr1.1.100.bed 
chr1    1       100     chr1    1       10      0.9
chr1    1       100     chr1    11      20      0.5
chr1    1       100     chr1    21      30      0.92
chr1    1       100     chr1    91      100     0.3

$ more chr1.101.200.bed
chr1    101     200     chr1    101     110     0.3
chr1    101     200     chr1    191     200     0.1

$ more chr1.201.451.bed 
chr1    201     451     chr1    201     210     0

Tools that natively support Unix streams make this easy to do.

ADD COMMENT
2
Entering edit mode
6.6 years ago
sed 's/\t/_/;s/\t/_/' input.bed  |while IFS='' read -a F ; do B=`echo "$F" | cut -f1`; echo "${F}" | cut -f 2- >> "${B}.bed"; done
ADD COMMENT

Login before adding your answer.

Traffic: 2102 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6