I wrote a tool called
sample (Github site) that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.
When sampling without replacement, this application offers a few advantages over common
sample tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use
mmaproutines to gain random access to data in the (regular) input file on both passes.
The benefit that
mmap provided was significant. For comparison purposes, we also add a
--cstdiooption to test the performance of the use of standard C I/O routines (
fseek(), etc.); predictably, this performed worse than the
mmap-based approach in all tests, but timing results were about identical with
gshuf on OS X and still an average 1.5x improvement over
shuf under Linux.
sample tool can be used to sample from any text file delimited by single newline characters (BED, SAM, VCF, etc.). Also, using the
--lines-per-offset option allows sampling and shuffling repeated multiples of newline character-delimited lines, useful for sampling from (for example) FASTQ files, where records are split across four lines.
By adding the
--preserve-order option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS
sort-bed - which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates - the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.
By omitting the sample size parameter, the
sample tool will shuffle the entire file. This tool can be used to shuffle files that
shuf has trouble with; however, in this use case it currently operates slower than
shuf can still be used). We recommend use of
shuf for shuffling an entire file, or specifying the sample size (up to the line count, if known ahead of time), when possible.
One downside at this time is that
sample does not process a standard input stream; the input must be a regular file.