I would like to find 12-mer short DNA sequences which do not map to the human genome and do not form self-mer as well. I want them to be as unique as possible. How can I get them? Which tool/software should I use?
I don't know what you mean by "not forming self-mer"..
"Self-mer" in this context would mean homodimerisation.
The only 2 ways to really assess this would be to use primer design software (since I suspect this is a primer design question anyway) and do some thermal calculations. A brute force way would be to simply check for 'palindromicity'.
Simple way would be to use https://eu.idtdna.com/calc/analyzer/home/batch but you're limited to 200 sequences at a time. Maybe check out Primer3 for a commandline tool.
Both comments are very useful. I generated all possible 12-mers in R:
x = expand.grid(rep(list(c("A", "C", "T", "G")), 12))
write.delim(x, file = "12-mers_all_combinations.txt", col.names = F)
I was thinking to "map" them to the reference genome with BLAT and pick the unmapped ones, but fragmenting the genome also sounds a good idea.
Edit: I finished the analysis: I used jellyfish to fragment the human genome (hg19) for all 12-mers present, then used bash comm to get the difference between the two:
jellyfish count -m 12 -s 3G -t 10 hg19.fa > 12-mer_counts.jf
jellyfish dump 12-mer_counts.jf | grep -v "^>" > 12-mers_present_in_hg19.txt
comm -23 12-mers_all_combinations.txt 12-mers_present_in_hg19.txt > 12-mers_absent_in_hg19.txt
$ wc -l 12-mers_*
Now let the biologist to dig into 168199 candidates to pick the best primers 😄