Split Bed File To Another Bed Files With Specific Cordinates

1

Entering edit mode

11.4 years ago

filipzembol ▴ 180

Dear all, I would like to split bed file to more files. The conditions are coordinates (from cooridnates 0-60000 (1st bed file), 60000 - 120000 (2nd bed file), 120000 - 180000 (3rd bed file), etc...)

for example: INPUT 1.bed:

                    chr1        0      1000
                    chr1   1500      3000
                    chr1  50000    60000
                    chr1  60500    80000
                    chr1  90000   100000
                    chr1 120001  120500

OUTPUT

           1a.bed: chr1        0      1000
                      chr1   1500      3000
                      chr1  50000    60000

             1b.bed: chr1  60500    80000
                     chr1  90000   100000

              1c.bed: chr1  120001  120500

Thank you. Filip

bed awk split • 6.2k views

ADD COMMENT • link updated 11.4 years ago by vj ▴ 520 • written 11.4 years ago by filipzembol ▴ 180

2

Entering edit mode

11.4 years ago

Alex Reynolds 36k

Let's say you are working with hg19 chromosome extents (a sorted BED file called hg19.extents.bed):

	chr1 0 249250621
	chr10 0 135534747
	chr11 0 135006516
	chr12 0 133851895
	chr13 0 115169878
	chr14 0 107349540
	chr15 0 102531392
	chr16 0 90354753
	chr17 0 81195210
	chr18 0 78077248
	chr19 0 59128983
	chr2 0 243199373
	chr20 0 63025520
	chr21 0 48129895
	chr22 0 51304566
	chr3 0 198022430
	chr4 0 191154276
	chr5 0 180915260
	chr6 0 171115067
	chr7 0 159138663
	chr8 0 146364022
	chr9 0 141213431
	chrM 0 16571
	chrX 0 155270560
	chrY 0 59373566

view raw hg19.extents.bed hosted with ❤ by GitHub

To solve your problem, you can use bedops --chop to split this extents file by 60k-base increments, and you would then run a BEDOPS bedops set operation to capture all the elements that fall within each increment.

First, split the extents:

$ bedops --chop 60000 hg19.extents.bed > hg19.60k.bed

Second, sort your input file with BEDOPS sort-bed, to prep it for use with BEDOPS tools:

$ sort-bed input.bed > input.sorted.bed

Third, run the equivalent of bedops --element-of 1 input.sorted.bed increment_i.bed for each increment i.

Here is a bash shell command that will do this:

$ incCounter=0; \
while read incLine; \
do \
    incCounterPadded=`printf %07d ${incCounter}`; \
    chromosomeName=`cut -f1 ${incLine}`; \
    outputFn="output_${chromosomeName}_${incCounterPadded}.bed"; \
    echo -e ${incLine} | bedops --element-of 1 input.sorted.bed - > ${outputFn}; \
    incCounter=$((incCounter+1)); \
done < hg19.60k.bed

Each of the files output_chr1_0000000.bed, output_chr1_0000001.bed, etc. contains elements of input.sorted.bed that fall within 60k windows across hg19.

Note that this will create many, many thousands of files for hg19. You may want to do a bit more work to filter operations to folders named by chromosome, or widen your window region, or apply other strategies to more sensibly manage the output from this script. Hopefully this gets you started.

Sjneph is also correct that my method will cause "double-counting" where an input element spans two adjoining increments. This may or may not be an issue for your analysis, but his bedmap approach also yields much more manageable output while highlighting potentially problematic element overlaps. I'd give his answer more attention, depending on what you're trying to do.

ADD COMMENT • link 10.4 years ago by Alex Reynolds 36k

2

Entering edit mode

11.4 years ago

sjneph ▴ 690

Alex is spot on that you will receive a tremendous number of files that would be difficult to manage. Instead, you could consider demarcating every row with an indicator of what file it would be in with your method.


$ awk \
     '{ \
        regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         val = 1; \
          for ( baseStart = regionStart; baseStart < regionStop; baseStart += 60000 ) { \
            baseStop = baseStart + 60000; \
            print regionChromosome"\t"baseStart"\t"baseStop"\t"val++; \
          } \
      }' hg19.extents.bed \
    | bedmap --echo --echo-map-id --delim "\t" 1.bed - > answer.1.bed

That will produce outputs such as:

chr1  0        1000  1
chr1  1500     3000  1
chr1  50000   60000  1
chr1  60500   80000  2
chr1  90000  100000  2
chr1 120001  120500  3

you run into potential problems where an element overlaps 2 boundary regions. In that case, bedmap will produce 2 outputs for column 4, separated by a semicolon.


chr1    111199  120001  2;3

Either use it as is, or perhaps add --multidelim "\t" to the bedmap call and then run everything through cut -f1-4 to use the first of the two boundary values in such cases. There is a lot you can do with this simple technique that may get you out of a lot of hassle with further downstream operations, and it's all in one file.

ADD COMMENT • link 11.4 years ago by sjneph ▴ 690

0

Entering edit mode

11.4 years ago

Kashyap Chhatbar ▴ 70

I would try that in python

import csv
aWriter = csv.writer(open("1a.bed", "w"), delimiter="\t")
bWriter = csv.writer(open("1b.bed", "w"), delimiter="\t")
cWriter = csv.writer(open("1c.bed", "w"), delimiter="\t")
bedReader = csv.reader(open("test.bed"), delimiter="\t")
for row in bedReader:
      if int(row[1]) >= 0 and int(row[2]) <= 60000:
              aWriter.writerow(row)
      elif int(row[1]) >= 60000 and int(row[2]) <= 120000:
              bWriter.writerow(row)
      elif int(row[1]) >= 120000 and int(row[2]) <= 180000:
              cWriter.writerow(row)

Hope it helps...

ADD COMMENT • link 11.4 years ago by Kashyap Chhatbar ▴ 70

0

Entering edit mode

Thank you, but i need it with step 60000 for all chromosome not only to 180000, but thanks. I am doing whole genome sequencing, I separate bed fie to each chromosomes and after that I need separate each chr1 to bed files by cooridnate condition with step 60000

ADD REPLY • link 11.4 years ago by filipzembol ▴ 180

0

Entering edit mode

Well, you can create a text file giving the ranges for what you want separated and read the ranges from the text file generated then use the logic to suite your needs.

ADD REPLY • link 11.4 years ago by Kashyap Chhatbar ▴ 70

Login before adding your answer.