I was also interested in this problem and I tried to find a solution in R:
We can take advantage of the
regioneR package and its
createRandomRegions function: it lets you sample regions and you control region size, variation in size, and overlapping output regions or not. The function allows you to specify a genome and also mask any region from which you do not want to sample regions.
The easiest solution would be to input your genes as the background genome from which to sample the regions, but the genome input can only have one region per chromosome. Thus, the final solution is to input the genome and mask all regions which are not genes. For example:
# Create GRanges for the whole genome and the genes in the genome
# Genes, or any BED of regions you like
txdb.genes <- genes(txdb)
txdb.genes <- reduce(txdb.genes, ignore.strand = T)
genome <- GRanges(
ranges=IRanges(start=rep(1, length(BSgenome.Hsapiens.UCSC.hg38)), end=seqlengths(BSgenome.Hsapiens.UCSC.hg38)),
# Mask anything in the genome which are not genes
mask <- setdiff(genome, txdb.genes, ignore.strand = T)
table(countOverlaps(txdb.genes, mask)) # check no genes are in the mask
# Create random regions
regions_in_genes <- createRandomRegions(nregions=1000, length.mean=1000, length.sd=0, genome = genome, mask = mask, non.overlapping = T)
table(countOverlaps(regions_in_genes, txdb.genes)) # Check that all your regions overlap with genes
table(countOverlaps(regions_in_genes, txdb.genes, type = "within")) # Check that all your regions are WITHIN genes
I have found a sufficient solution:
This bash script takes a 1bp "peak" every 100 bp inside all genes and prints them to a new file. One can then take ranom positions from this File and enlarge the peaks with bedtools slop.
Thank you for your time!
A couple thoughts:
Your sampling approach will favor longer genes. Samples will contain more 1bp "pseudo"-peaks from longer genes than shorter genes, which can introduce unwanted bias. This may or may not present a problem for your analysis, but you should be aware of this bias when calling the result "random".
RANDOMis a reserved variable in
bash, which you may or may not be able to override. It is recommended to use lowercase names for variables in
bash, so that you don't end up using reserved POSIX variable names and get unintended (garbage) results.
I wrote a
bashscript here that will draw a uniform sample with replacement of
Nintervals of length
There is a short demo of how to use it at the bottom of the page, which samples ten 10kb intervals and writes them as sorted output, ready for further set operations:
The chromosome names and sizes can be replaced if you are using a different genome assembly.
This script uses the rejection sampling approach I mentioned in my answer below. This approach takes into account the length of chromosomes over
hg38when drawing random integers, by rejecting any integers that are longer than the genome, as well as rejecting those which create intervals that extend over the end bounds of a chromosome.
Once you have samples, it is a simple matter to filter them, by way of piping to
Repeat until enough samples are collected, as needed.