Question: Find 12-mer not mapping to the human genome
0
gravatar for bounlu
7 months ago by
bounlu170
Singapore
bounlu170 wrote:

Hi,

I would like to find 12-mer short DNA sequences which do not map to the human genome and do not form self-mer as well. I want them to be as unique as possible. How can I get them? Which tool/software should I use?

ADD COMMENTlink modified 7 months ago by 5heikki8.4k • written 7 months ago by bounlu170
2
gravatar for 5heikki
7 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
  1. Fragment the human genome into 12-mers with e.g. jellyfish
  2. Generate all possible 12-mers with e.g. this
  3. Use comm to get lines unique to your all 12-mers file

I don't know what you mean by "not forming self-mer"..

ADD COMMENTlink written 7 months ago by 5heikki8.4k

"Self-mer" in this context would mean homodimerisation.

The only 2 ways to really assess this would be to use primer design software (since I suspect this is a primer design question anyway) and do some thermal calculations. A brute force way would be to simply check for 'palindromicity'.

Simple way would be to use https://eu.idtdna.com/calc/analyzer/home/batch but you're limited to 200 sequences at a time. Maybe check out Primer3 for a commandline tool.

ADD REPLYlink written 7 months ago by jrj.healey12k

Both comments are very useful. I generated all possible 12-mers in R:

x = expand.grid(rep(list(c("A", "C", "T", "G")), 12))
write.delim(x, file = "12-mers_all_combinations.txt", col.names = F)

I was thinking to "map" them to the reference genome with BLAT and pick the unmapped ones, but fragmenting the genome also sounds a good idea.

Edit: I finished the analysis: I used jellyfish to fragment the human genome (hg19) for all 12-mers present, then used bash comm to get the difference between the two:

jellyfish count -m 12 -s 3G -t 10 hg19.fa > 12-mer_counts.jf
jellyfish dump 12-mer_counts.jf | grep -v "^>" > 12-mers_present_in_hg19.txt
comm -23 12-mers_all_combinations.txt  12-mers_present_in_hg19.txt > 12-mers_absent_in_hg19.txt

$ wc -l 12-mers_*
 16777216 12-mers_all_combinations.txt
 16609017 12-mers_present_in_hg19.txt
   168199 12-mers_absent_in_hg19.txt

Now let the biologist to dig into 168199 candidates to pick the best primers 😄

ADD REPLYlink modified 7 months ago • written 7 months ago by bounlu170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour