Question: Split Bed File To Another Bed Files With Specific Cordinates
gravatar for filipzembol
6.4 years ago by
filipzembol110 wrote:

Dear all, I would like to split bed file to more files. The conditions are coordinates (from cooridnates 0-60000 (1st bed file), 60000 - 120000 (2nd bed file), 120000 - 180000 (3rd bed file), etc...)

for example: INPUT 1.bed:

                    chr1        0      1000
                    chr1   1500      3000
                    chr1  50000    60000
                    chr1  60500    80000
                    chr1  90000   100000
                    chr1 120001  120500


           1a.bed: chr1        0      1000
                      chr1   1500      3000
                      chr1  50000    60000

             1b.bed: chr1  60500    80000
                     chr1  90000   100000

              1c.bed: chr1  120001  120500

Thank you. Filip

bed split awk • 3.5k views
ADD COMMENTlink modified 6.4 years ago by vj430 • written 6.4 years ago by filipzembol110
gravatar for Alex Reynolds
6.4 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Let's say you are working with hg19 chromosome extents (a sorted BED file called hg19.extents.bed):

To solve your problem, you can use bedops --chop to split this extents file by 60k-base increments, and you would then run a BEDOPS bedops set operation to capture all the elements that fall within each increment.

First, split the extents:

$ bedops --chop 60000 hg19.extents.bed > hg19.60k.bed

Second, sort your input file with BEDOPS sort-bed, to prep it for use with BEDOPS tools:

$ sort-bed input.bed > input.sorted.bed

Third, run the equivalent of bedops --element-of 1 input.sorted.bed increment_i.bed for each increment i.

Here is a bash shell command that will do this:

$ incCounter=0; \
while read incLine; \
do \
    incCounterPadded=`printf %07d ${incCounter}`; \
    chromosomeName=`cut -f1 ${incLine}`; \
    outputFn="output_${chromosomeName}_${incCounterPadded}.bed"; \
    echo -e ${incLine} | bedops --element-of 1 input.sorted.bed - > ${outputFn}; \
    incCounter=$((incCounter+1)); \
done < hg19.60k.bed

Each of the files output_chr1_0000000.bed, output_chr1_0000001.bed, etc. contains elements of input.sorted.bed that fall within 60k windows across hg19.

Note that this will create many, many thousands of files for hg19. You may want to do a bit more work to filter operations to folders named by chromosome, or widen your window region, or apply other strategies to more sensibly manage the output from this script. Hopefully this gets you started.

Sjneph is also correct that my method will cause "double-counting" where an input element spans two adjoining increments. This may or may not be an issue for your analysis, but his bedmap approach also yields much more manageable output while highlighting potentially problematic element overlaps. I'd give his answer more attention, depending on what you're trying to do.

ADD COMMENTlink modified 5.4 years ago • written 6.4 years ago by Alex Reynolds30k
gravatar for sjneph
6.4 years ago by
sjneph620 wrote:

Alex is spot on that you will receive a tremendous number of files that would be difficult to manage. Instead, you could consider demarcating every row with an indicator of what file it would be in with your method.

$ awk \
     '{ \
        regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         val = 1; \
          for ( baseStart = regionStart; baseStart < regionStop; baseStart += 60000 ) { \
            baseStop = baseStart + 60000; \
            print regionChromosome"\t"baseStart"\t"baseStop"\t"val++; \
          } \
      }' hg19.extents.bed \
    | bedmap --echo --echo-map-id --delim "\t" 1.bed - > answer.1.bed

That will produce outputs such as:

chr1  0        1000  1
chr1  1500     3000  1
chr1  50000   60000  1
chr1  60500   80000  2
chr1  90000  100000  2
chr1 120001  120500  3

you run into potential problems where an element overlaps 2 boundary regions. In that case, bedmap will produce 2 outputs for column 4, separated by a semicolon.

chr1    111199  120001  2;3

Either use it as is, or perhaps add --multidelim "\t" to the bedmap call and then run everything through cut -f1-4 to use the first of the two boundary values in such cases. There is a lot you can do with this simple technique that may get you out of a lot of hassle with further downstream operations, and it's all in one file.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by sjneph620
gravatar for Kashyap Chhatbar
6.4 years ago by
Edinburgh, UK
Kashyap Chhatbar70 wrote:

I would try that in python

import csv
aWriter = csv.writer(open("1a.bed", "w"), delimiter="\t")
bWriter = csv.writer(open("1b.bed", "w"), delimiter="\t")
cWriter = csv.writer(open("1c.bed", "w"), delimiter="\t")
bedReader = csv.reader(open("test.bed"), delimiter="\t")
for row in bedReader:
      if int(row[1]) >= 0 and int(row[2]) <= 60000:
      elif int(row[1]) >= 60000 and int(row[2]) <= 120000:
      elif int(row[1]) >= 120000 and int(row[2]) <= 180000:

Hope it helps...

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Kashyap Chhatbar70

Thank you, but i need it with step 60000 for all chromosome not only to 180000, but thanks. I am doing whole genome sequencing, I separate bed fie to each chromosomes and after that I need separate each chr1 to bed files by cooridnate condition with step 60000

ADD REPLYlink written 6.4 years ago by filipzembol110

Well, you can create a text file giving the ranges for what you want separated and read the ranges from the text file generated then use the logic to suite your needs.

ADD REPLYlink written 6.4 years ago by Kashyap Chhatbar70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1945 users visited in the last hour