Let's say I have a BED file that defines my genomic space as follows:
chrN a1 b1
chrN a2 b2
...
I would like to uniformly sample (with or without replacement) segments of length i
from the genomic space that this BED file defines. (Assume that i
is equal to or smaller in length than the smallest segment in the BED file.)
My first naive approach is to expand the BED input to:
chrN a1 a1 + i
chrN a1 + 1 a1 + i + 1
chrN a1 + 2 a1 + i + 2
...
chrN b1 - i b1
chrN a2 a2 + i
...
chrN b2 - i b2
...
Then I would sample from this new set by mapping each line to an ID (say, a line number) and uniformly sampling line numbers as identifiers. Once I have the identifier, I get back the subrange element.
Blowing up the BED input like this and setting up the map is a lot of work for large inputs. Is there a smarter/faster/less-memory-intensive way to approach this task?