It looks like a pretty straightforward scripting problem. You could efficiently solve overlaps at the end with
bedops --partition, which is not available in other toolkits, as far as I know.
Here's a script you can use to subsample from a shuffled list of intervals:
parser = argparse.ArgumentParser(description='Subsample within intervals in BED file')
parser.add_argument("-n", "--n", type=int, help='sample count')
args = parser.parse_args()
c = 0
for line in sys.stdin:
elems = line.strip().split('\t')
start = int(elems)
stop = int(elems)
length = stop - start
sample_start_offset = random.randint(0, length)
sample_length = random.randint(1, length - sample_start_offset - 1)
elems = str(start + sample_start_offset)
elems = str(start + sample_start_offset + sample_length)
sys.stdout.write("%s\n" % ('\t'.join(elems)))
c += 1
if c == args.n:
if __name__ == "__main__":
Here's how it could be used to sample 123 random subintervals:
$ subsample.py --n=123 < <(shuf intervals.bed) | sort-bed - | bedops --partition - | shuf -n 123 - | sort-bed - > answer.bed
How this works:
<(shuf intervals.bed) is a bash process substitution. You can use this to feed a shuffled list of intervals into
subsample.py (the above-listed Python script). You can use
shuf or any tool to shuffle lines of a file.
If your original intervals file is whole-genome scale and won't fit into the memory allotment for
shuf (i.e. you get out-of-memory errors using
shuf), I have a tool called sample that almost always gets around this limitation.
The Python script
subsample.py parses each randomly-selected interval for its start and stop position, using this to generate a subinterval with a randomly selected start position and length, chosen such that the subinterval will fall entirely within the parent interval.
The output of this script is BED-formatted, but it is unsorted. We pipe this to
sort-bed to sort it, so that downstream tools like
bedops can work with it.
It is possible that subintervals overlap. So the sorted BED is piped to
bedops --partition to make all elements disjoint.
If we split any subintervals, we'll end up with more subintervals than we originally asked for. So we pipe disjoint elements to
shuf -n to grab N random subintervals of interest, and pipe that again to
sort-bed to get sorted BED output. (We could use
head -n, but on reflection, I think that could introduce bias in the sample.)
It looks a bit involved, but I suspect this will run a lot faster, especially on large inputs and sample sizes.