Question: Split bed file by unique first three columns
0
gravatar for graeme.thorn
3.2 years ago by
graeme.thorn50
London, United Kingdom
graeme.thorn50 wrote:

I have a long BED-like file generated by bedtools intersect which has the following form:

chr1 1 100     chr1 1 10 0.9
chr1 1 100     chr1 11 20 0.5
chr1 1 100     chr1 21 30 0.92
....
chr1 1 100     chr1 91 100 0.3
chr1 101 200   chr1 101 110 0.3
....
chr1 101 200   chr1 191 200 0.1
chr1 201 451   chr1 201 210 0
....
etc

with multiple rows per region demarcated by the first three columns. I would like to split this into individual files per region (first three columns), so in this case, there'd be a file for chr1.1-100.bed, a file for chr1.101-200.bed, one for chr1.201-451.bed, each consisting of those lines with the region as the first three columns. Is there a quick way of doing this? I could knock something up in R (as I'm not too familiar with python), but there may be a faster way.

split bed bedtools • 910 views
ADD COMMENTlink modified 3.2 years ago by Pierre Lindenbaum131k • written 3.2 years ago by graeme.thorn50
3
gravatar for Alex Reynolds
3.2 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Assuming BEDOPS bedops and sort-bed, GNU awk, and bash:

$ sort-bed regions.unsorted.bed > regions.bed
$ awk '{ k=$1"."$2"."$3; if (!(a[k])) { print $0; a[k]+=1; } }' regions.bed | cut -f1-3 | sort-bed - > unique.bed

Then:

$ while read -r region; do \
    key=`echo ${region} | tr " " "."`; \
    echo -e "${region}" | bedops -e 100% regions.bed - > ${key}.bed; \
  done < unique.bed

This creates three files from each unique interval from your sample input:

$ more chr1.1.100.bed 
chr1    1       100     chr1    1       10      0.9
chr1    1       100     chr1    11      20      0.5
chr1    1       100     chr1    21      30      0.92
chr1    1       100     chr1    91      100     0.3

$ more chr1.101.200.bed
chr1    101     200     chr1    101     110     0.3
chr1    101     200     chr1    191     200     0.1

$ more chr1.201.451.bed 
chr1    201     451     chr1    201     210     0

Tools that natively support Unix streams make this easy to do.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Alex Reynolds31k
2
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
sed 's/\t/_/;s/\t/_/' input.bed  |while IFS='' read -a F ; do B=`echo "$F" | cut -f1`; echo "${F}" | cut -f 2- >> "${B}.bed"; done
ADD COMMENTlink written 3.2 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour