Question

Split Bed File To Another Bed Files With Specific Cordinates

1

Entering edit mode

10.2 years ago

filipzembol ▴ 170

Dear all, I would like to split bed file to more files. The conditions are coordinates (from cooridnates 0-60000 (1st bed file), 60000 - 120000 (2nd bed file), 120000 - 180000 (3rd bed file), etc...)

for example: INPUT 1.bed:

                    chr1        0      1000
                    chr1   1500      3000
                    chr1  50000    60000
                    chr1  60500    80000
                    chr1  90000   100000
                    chr1 120001  120500

OUTPUT

           1a.bed: chr1        0      1000
                      chr1   1500      3000
                      chr1  50000    60000

             1b.bed: chr1  60500    80000
                     chr1  90000   100000

              1c.bed: chr1  120001  120500

Thank you. Filip

bed awk split • 5.5k views

ADD COMMENT • link updated 10.2 years ago by vj ▴ 520 • written 10.2 years ago by filipzembol ▴ 170

score 2 · Answer 1 · 2014-02-19

Let's say you are working with hg19 chromosome extents (a sorted BED file called hg19.extents.bed):

To solve your problem, you can use bedops --chop to split this extents file by 60k-base increments, and you would then run a BEDOPS bedops set operation to capture all the elements that fall within each increment.

First, split the extents:

$ bedops --chop 60000 hg19.extents.bed > hg19.60k.bed

Second, sort your input file with BEDOPS sort-bed, to prep it for use with BEDOPS tools:

$ sort-bed input.bed > input.sorted.bed

Third, run the equivalent of bedops --element-of 1 input.sorted.bed increment_i.bed for each increment i.

Here is a bash shell command that will do this:

$ incCounter=0; \
while read incLine; \
do \
    incCounterPadded=`printf %07d ${incCounter}`; \
    chromosomeName=`cut -f1 ${incLine}`; \
    outputFn="output_${chromosomeName}_${incCounterPadded}.bed"; \
    echo -e ${incLine} | bedops --element-of 1 input.sorted.bed - > ${outputFn}; \
    incCounter=$((incCounter+1)); \
done < hg19.60k.bed

Each of the files output_chr1_0000000.bed, output_chr1_0000001.bed, etc. contains elements of input.sorted.bed that fall within 60k windows across hg19.

Note that this will create many, many thousands of files for hg19. You may want to do a bit more work to filter operations to folders named by chromosome, or widen your window region, or apply other strategies to more sensibly manage the output from this script. Hopefully this gets you started.

Sjneph is also correct that my method will cause "double-counting" where an input element spans two adjoining increments. This may or may not be an issue for your analysis, but his bedmap approach also yields much more manageable output while highlighting potentially problematic element overlaps. I'd give his answer more attention, depending on what you're trying to do.

score 2 · Answer 2 · 2014-02-19

Alex is spot on that you will receive a tremendous number of files that would be difficult to manage. Instead, you could consider demarcating every row with an indicator of what file it would be in with your method.


$ awk \
     '{ \
        regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         val = 1; \
          for ( baseStart = regionStart; baseStart < regionStop; baseStart += 60000 ) { \
            baseStop = baseStart + 60000; \
            print regionChromosome"\t"baseStart"\t"baseStop"\t"val++; \
          } \
      }' hg19.extents.bed \
    | bedmap --echo --echo-map-id --delim "\t" 1.bed - > answer.1.bed

That will produce outputs such as:

chr1  0        1000  1
chr1  1500     3000  1
chr1  50000   60000  1
chr1  60500   80000  2
chr1  90000  100000  2
chr1 120001  120500  3

you run into potential problems where an element overlaps 2 boundary regions. In that case, bedmap will produce 2 outputs for column 4, separated by a semicolon.


chr1    111199  120001  2;3

Either use it as is, or perhaps add --multidelim "\t" to the bedmap call and then run everything through cut -f1-4 to use the first of the two boundary values in such cases. There is a lot you can do with this simple technique that may get you out of a lot of hassle with further downstream operations, and it's all in one file.

score 0 · Answer 3 · 2014-02-19

0

Entering edit mode

10.2 years ago

Kashyap Chhatbar ▴ 70

I would try that in python

import csv
aWriter = csv.writer(open("1a.bed", "w"), delimiter="\t")
bWriter = csv.writer(open("1b.bed", "w"), delimiter="\t")
cWriter = csv.writer(open("1c.bed", "w"), delimiter="\t")
bedReader = csv.reader(open("test.bed"), delimiter="\t")
for row in bedReader:
      if int(row[1]) >= 0 and int(row[2]) <= 60000:
              aWriter.writerow(row)
      elif int(row[1]) >= 60000 and int(row[2]) <= 120000:
              bWriter.writerow(row)
      elif int(row[1]) >= 120000 and int(row[2]) <= 180000:
              cWriter.writerow(row)

Hope it helps...

ADD COMMENT • link 10.2 years ago by Kashyap Chhatbar ▴ 70

0

Entering edit mode

Thank you, but i need it with step 60000 for all chromosome not only to 180000, but thanks. I am doing whole genome sequencing, I separate bed fie to each chromosomes and after that I need separate each chr1 to bed files by cooridnate condition with step 60000

ADD REPLY • link 10.2 years ago by filipzembol ▴ 170

0

Entering edit mode

Well, you can create a text file giving the ranges for what you want separated and read the ranges from the text file generated then use the logic to suite your needs.

ADD REPLY • link 10.2 years ago by Kashyap Chhatbar ▴ 70