Find 12-mer not mapping to the human genome
4.8 years ago
Ömer An ▴ 260


I would like to find 12-mer short DNA sequences which do not map to the human genome and do not form self-mer as well. I want them to be as unique as possible. How can I get them? Which tool/software should I use?

4.8 years ago
5heikki 11k
  1. Fragment the human genome into 12-mers with e.g. jellyfish
  2. Generate all possible 12-mers with e.g. this
  3. Use comm to get lines unique to your all 12-mers file

I don't know what you mean by "not forming self-mer"..

"Self-mer" in this context would mean homodimerisation.

The only 2 ways to really assess this would be to use primer design software (since I suspect this is a primer design question anyway) and do some thermal calculations. A brute force way would be to simply check for 'palindromicity'.

Simple way would be to use but you're limited to 200 sequences at a time. Maybe check out Primer3 for a commandline tool.

Both comments are very useful. I generated all possible 12-mers in R:

x = expand.grid(rep(list(c("A", "C", "T", "G")), 12))
write.delim(x, file = "12-mers_all_combinations.txt", col.names = F)

I was thinking to "map" them to the reference genome with BLAT and pick the unmapped ones, but fragmenting the genome also sounds a good idea.

Edit: I finished the analysis: I used jellyfish to fragment the human genome (hg19) for all 12-mers present, then used bash comm to get the difference between the two:

jellyfish count -m 12 -s 3G -t 10 hg19.fa > 12-mer_counts.jf
jellyfish dump 12-mer_counts.jf | grep -v "^>" > 12-mers_present_in_hg19.txt
comm -23 12-mers_all_combinations.txt  12-mers_present_in_hg19.txt > 12-mers_absent_in_hg19.txt

$ wc -l 12-mers_*
 16777216 12-mers_all_combinations.txt
 16609017 12-mers_present_in_hg19.txt
   168199 12-mers_absent_in_hg19.txt

Now let the biologist dig into 168199 candidates to pick the best primers 😄


