Question: Find 12-mer not mapping to the human genome
gravatar for Ömer An
17 months ago by
Ömer An190
Ömer An190 wrote:


I would like to find 12-mer short DNA sequences which do not map to the human genome and do not form self-mer as well. I want them to be as unique as possible. How can I get them? Which tool/software should I use?

ADD COMMENTlink modified 17 months ago by 5heikki8.6k • written 17 months ago by Ömer An190
gravatar for 5heikki
17 months ago by
5heikki8.6k wrote:
  1. Fragment the human genome into 12-mers with e.g. jellyfish
  2. Generate all possible 12-mers with e.g. this
  3. Use comm to get lines unique to your all 12-mers file

I don't know what you mean by "not forming self-mer"..

ADD COMMENTlink written 17 months ago by 5heikki8.6k

"Self-mer" in this context would mean homodimerisation.

The only 2 ways to really assess this would be to use primer design software (since I suspect this is a primer design question anyway) and do some thermal calculations. A brute force way would be to simply check for 'palindromicity'.

Simple way would be to use but you're limited to 200 sequences at a time. Maybe check out Primer3 for a commandline tool.

ADD REPLYlink written 17 months ago by Joe16k

Both comments are very useful. I generated all possible 12-mers in R:

x = expand.grid(rep(list(c("A", "C", "T", "G")), 12))
write.delim(x, file = "12-mers_all_combinations.txt", col.names = F)

I was thinking to "map" them to the reference genome with BLAT and pick the unmapped ones, but fragmenting the genome also sounds a good idea.

Edit: I finished the analysis: I used jellyfish to fragment the human genome (hg19) for all 12-mers present, then used bash comm to get the difference between the two:

jellyfish count -m 12 -s 3G -t 10 hg19.fa > 12-mer_counts.jf
jellyfish dump 12-mer_counts.jf | grep -v "^>" > 12-mers_present_in_hg19.txt
comm -23 12-mers_all_combinations.txt  12-mers_present_in_hg19.txt > 12-mers_absent_in_hg19.txt

$ wc -l 12-mers_*
 16777216 12-mers_all_combinations.txt
 16609017 12-mers_present_in_hg19.txt
   168199 12-mers_absent_in_hg19.txt

Now let the biologist dig into 168199 candidates to pick the best primers 😄

ADD REPLYlink modified 4 months ago • written 17 months ago by Ömer An190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2243 users visited in the last hour