Let's say I have a BED file that defines my genomic space as follows:
chrN a1 b1 chrN a2 b2 ...
I would like to uniformly sample (with or without replacement) segments of length
i from the genomic space that this BED file defines. (Assume that
i is equal to or smaller in length than the smallest segment in the BED file.)
My first naive approach is to expand the BED input to:
chrN a1 a1 + i chrN a1 + 1 a1 + i + 1 chrN a1 + 2 a1 + i + 2 ... chrN b1 - i b1 chrN a2 a2 + i ... chrN b2 - i b2 ...
Then I would sample from this new set by mapping each line to an ID (say, a line number) and uniformly sampling line numbers as identifiers. Once I have the identifier, I get back the subrange element.
Blowing up the BED input like this and setting up the map is a lot of work for large inputs. Is there a smarter/faster/less-memory-intensive way to approach this task?