Question: Split bed file by unique first three columns
0
gravatar for graeme.thorn
23 months ago by
graeme.thorn40
London, United Kingdom
graeme.thorn40 wrote:

I have a long BED-like file generated by bedtools intersect which has the following form:

chr1 1 100     chr1 1 10 0.9
chr1 1 100     chr1 11 20 0.5
chr1 1 100     chr1 21 30 0.92
....
chr1 1 100     chr1 91 100 0.3
chr1 101 200   chr1 101 110 0.3
....
chr1 101 200   chr1 191 200 0.1
chr1 201 451   chr1 201 210 0
....
etc

with multiple rows per region demarcated by the first three columns. I would like to split this into individual files per region (first three columns), so in this case, there'd be a file for chr1.1-100.bed, a file for chr1.101-200.bed, one for chr1.201-451.bed, each consisting of those lines with the region as the first three columns. Is there a quick way of doing this? I could knock something up in R (as I'm not too familiar with python), but there may be a faster way.

split bed bedtools • 640 views
ADD COMMENTlink modified 23 months ago by Pierre Lindenbaum122k • written 23 months ago by graeme.thorn40
3
gravatar for Alex Reynolds
23 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Assuming BEDOPS bedops and sort-bed, GNU awk, and bash:

$ sort-bed regions.unsorted.bed > regions.bed
$ awk '{ k=$1"."$2"."$3; if (!(a[k])) { print $0; a[k]+=1; } }' regions.bed | cut -f1-3 | sort-bed - > unique.bed

Then:

$ while read -r region; do \
    key=`echo ${region} | tr " " "."`; \
    echo -e "${region}" | bedops -e 100% regions.bed - > ${key}.bed; \
  done < unique.bed

This creates three files from each unique interval from your sample input:

$ more chr1.1.100.bed 
chr1    1       100     chr1    1       10      0.9
chr1    1       100     chr1    11      20      0.5
chr1    1       100     chr1    21      30      0.92
chr1    1       100     chr1    91      100     0.3

$ more chr1.101.200.bed
chr1    101     200     chr1    101     110     0.3
chr1    101     200     chr1    191     200     0.1

$ more chr1.201.451.bed 
chr1    201     451     chr1    201     210     0

Tools that natively support Unix streams make this easy to do.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Alex Reynolds28k
2
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:
sed 's/\t/_/;s/\t/_/' input.bed  |while IFS='' read -a F ; do B=`echo "$F" | cut -f1`; echo "${F}" | cut -f 2- >> "${B}.bed"; done
ADD COMMENTlink written 23 months ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1121 users visited in the last hour