Question: Randomly sample motifs from reference sequence
gravatar for banerjeeshayantan
29 days ago by
banerjeeshayantan80 wrote:

I want to randomly sample 'n' number of sequence motifs of length 'k' from the reference sequence hg38.fa. Can someone help me out?

assembly genome • 232 views
ADD COMMENTlink modified 27 days ago by bernatgel1.7k • written 29 days ago by banerjeeshayantan80

I am reasonably certain that what you are looking for can be found in @Pierre and @Kevin's answers here: A: Finding 16 mer not present in GRCh38 Kevin's answer will work, if k < 17.

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax62k
gravatar for jrj.healey
29 days ago by
United Kingdom
jrj.healey10k wrote:

This code does the job. It's not terrible at scale either.

#!/usr/bin/env python
import re, random

n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
selected = random.sample(kmers.findall(seq), n)    # Where 'seq' is the genome sequence as a string.

Change k (motif/kmer length) and n (random sample size) to suit.

Since posting this answer I've done some benchmarking, and this enumerated (n = ) 10 random (k = ) 9-mers from each entry in GCA_000001405.15_GRCh38_genomic.fna. Note that this should scale reasonably well for sampling larger n, but larger k will eventually become problematic


The total code execution time was: 00:22:03.37 (slightly over 22 minutes). See the gist below for the the results for each individual fasta in the multi .fna.

Link to gist (since it makes the post too long)

ADD COMMENTlink modified 28 days ago • written 29 days ago by jrj.healey10k

Thanks a lot for your reply. Your code randomly samples it from each chromosome. Is it possible to do this sampling from the entire genome, instead of individual chromosomes?

ADD REPLYlink written 28 days ago by banerjeeshayantan80

Well, it's essentially the same thing, since you get kmers from all sequences as a 'pool'. It might not be 100% fully statistically random, but someone with more intuition on that would have to weigh in. Certain kmers could be over or underrepresented in this approach I guess.

If you concatenated all of the fastas together and run it, it should still work, but there are 2 issues:

  1. You're reading an enormous file in to memory in one hit by that method.
  2. You'd create 'artificial' kmers, which correspond to the joins between chromosomes. (e.g the last 4 bases of chr1, and the first five of chr2 etc.). This might not be an issue if those kmers have already arisen normally, but thats not guaranteed.

I suppose the way to do it might be to read each chromosome, get its kmers, and add all of them to a 'master' pool of all kmers, and then randomly select from that. Storing those kmers will require a fair amount of memory though.

The concatenated approach would look something like:

$ cat GRCh38.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g' > GRCh38_concatenated.fasta

Then run the code as before on the newly created file.

Alternatively, iterate each chromosome and populate the pool along the lines of the following (I'm not going to run/test it as it will likely take too long):

#!/usr/bin/env python
import re, random
from Bio import SeqIO

n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
all_kmers = []
for record in SeqIO.parse('GCRh38.fasta', 'fasta'):

selected = random.sample(all_kmers, n)
ADD REPLYlink modified 28 days ago • written 28 days ago by jrj.healey10k

I tried your code in the last comment and it's taking an unusually long time. You were right! So I am sticking to the chromosome-wise sampling. Just a thought: What if I generate multiple random ranges of length 10 (say) and then use these to extract sequences. So I will generate something like 189-200, 7767655-7767666, 8987870-8987881 etc.

ADD REPLYlink written 27 days ago by banerjeeshayantan80

I’m not sure I fully understand what you mean but I don’t think you can just randomly choose ranges. There would be no guarantee of you coming anywhere close to sampling the full kmer space.

There is a reason people don’t tend to do these sorts of brute force kmer calculations (exactly this reason). You could probably parallelise the code and work on each sequence separately, which would speed it up enormously however.

ADD REPLYlink written 27 days ago by jrj.healey10k

Say there are 2899452029 base pairs in the reference genome. Then I can choose the starting position of the kmer uniformly from 1 to 2899452029 at random. This will continue until I get the required number of kmers.

ADD REPLYlink written 27 days ago by banerjeeshayantan80

Hmm yeah perhaps that would work. You'll have to decide what you do if it returns the same kmer in that case (and if you use a concatenated sequence for that, there is still the issue of 'artificial kmers').

ADD REPLYlink modified 27 days ago • written 27 days ago by jrj.healey10k
gravatar for bernatgel
27 days ago by
Barcelona, Spain
bernatgel1.7k wrote:

If I understand your question correctly, you want to randomly extract n sequences of length k from the hg38 genome. If this is the case, in R you can use regioneR package in Bioconductor. You can use the function createRandomRegions to randomly select a number of positions in the genome and then use getSeq from BSgenome to extract the sequences.

For example


regs <- createRandomRegions(nregions = 100, length.mean = 10, = 0, genome = "hg38", mask=NA)
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)

Will extract 100 region 10bp each from the genome. By default createRandomRegions will avoid placing the regions in certain parts of the genome (highly repetitive, low complexity, stretches of N's...) using the default mask in the BSgenome object. Setting mask=NA will disable any mask and allow it to place the regions anywhere in the genome with uniform probability.

Is this what you were looking for?

ADD COMMENTlink written 27 days ago by bernatgel1.7k

Hi, This is a great tool. thanks a lot for suggesting. Can you also specify how not to get Ns in my results? How do I mask that? Removing mask=NA from the above code isn't helping.

ADD REPLYlink written 27 days ago by banerjeeshayantan80

Yes, true. To use the mask you need to load the masked version of the BSgenome package


regs <- createRandomRegions(nregions = 100, length.mean = 10, = 0, genome = "hg38")
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)

With that you should not get the N's. However, take into account that the default mask do not have only the N regions but many regions identified by repeat masker, and others (refer to the BSgenome documentation for a detailed account of the masked regions). This will affect your results, since some motifs are heavily enriched in repeats and other masked elements. Also take into account that this problem will be present also with any other approach.

Maybe a better option would be to get more seqs than needed, remove the ones with Ns and then randomly sample the reduced list.

ADD REPLYlink written 26 days ago by bernatgel1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2275 users visited in the last hour