Dear all,
I would like to split bed file to more files. The conditions are coordinates (from cooridnates 0-60000 (1st bed file), 60000 - 120000 (2nd bed file), 120000 - 180000 (3rd bed file), etc...)
Let's say you are working with hg19 chromosome extents (a sorted BED file called hg19.extents.bed):
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To solve your problem, you can use bedops --chop to split this extents file by 60k-base increments, and you would then run a BEDOPS bedops set operation to capture all the elements that fall within each increment.
Each of the files output_chr1_0000000.bed, output_chr1_0000001.bed, etc. contains elements of input.sorted.bed that fall within 60k windows across hg19.
Note that this will create many, many thousands of files for hg19. You may want to do a bit more work to filter operations to folders named by chromosome, or widen your window region, or apply other strategies to more sensibly manage the output from this script. Hopefully this gets you started.
Sjneph is also correct that my method will cause "double-counting" where an input element spans two adjoining increments. This may or may not be an issue for your analysis, but his bedmap approach also yields much more manageable output while highlighting potentially problematic element overlaps. I'd give his answer more attention, depending on what you're trying to do.
Alex is spot on that you will receive a tremendous number of files that would be difficult to manage. Instead, you could consider demarcating every row with an indicator of what file it would be in with your method.
you run into potential problems where an element overlaps 2 boundary regions. In that case, bedmap will produce 2 outputs for column 4, separated by a semicolon.
chr1 111199 120001 2;3
Either use it as is, or perhaps add --multidelim "\t" to the bedmap call and then run everything through cut -f1-4 to use the first of the two boundary values in such cases. There is a lot you can do with this simple technique that may get you out of a lot of hassle with further downstream operations, and it's all in one file.
Thank you, but i need it with step 60000 for all chromosome not only to 180000, but thanks. I am doing whole genome sequencing, I separate bed fie to each chromosomes and after that I need separate each chr1 to bed files by cooridnate condition with step 60000
Well, you can create a text file giving the ranges for what you want separated and read the ranges from the text file generated then use the logic to suite your needs.
Thank you, but i need it with step 60000 for all chromosome not only to 180000, but thanks. I am doing whole genome sequencing, I separate bed fie to each chromosomes and after that I need separate each chr1 to bed files by cooridnate condition with step 60000
Well, you can create a text file giving the ranges for what you want separated and read the ranges from the text file generated then use the logic to suite your needs.