3.0 years ago by
Seattle, WA USA
You don't need to split the file.
Sort your BED file with BEDOPS
sort-bed, if unsorted:
$ sort-bed reads.unsorted.bed > reads.bed
Then pass the chromosome name to
--chrom <chromosome> to do work only on that chromosome:
$ bedops --chrom chrN --operations ...
$ bedmap --chrom chrN --operations ...
If you want a list of all operations, take a look at the BEDOPS documentation.
If you want a fast list of chromosomes:
$ bedextract --list-chr reads.bed
You can pipe this list to a script loop, to do work over each chromosome on a computational cluster. BEDOPS enables parallelization by chromosome pretty easily.
These operations are very fast, because this uses the sort order in
sort-bed sorted BED files to do a binary search that determines chromosome bounds. Operations that take minutes or hours can go down to seconds.
If your reads are single-end, you should be able to add
--faster to BEDOPS commands to make them use similar techniques to speed operations up even further.
Paired-end reads can result in "nested" elements that prevent the use of
--faster without some pre-processing tricks. Your data is on the multi-GB scale such that, if you are working with paired-end reads, you may be interested in investigating these tricks if you're doing these operations repeatedly. See the "extra work" note in the