Question: How to find the shortest k-mer length that is unique in a large genome
0
gravatar for abalakrishnan95
3.6 years ago by
United States
abalakrishnan950 wrote:

Hey all,

 

I have been trying to figure out the equation to figure out what the shortest length a k-mer can be to be unique in a large genome. 

Anything would help,

 

Thank you

genome • 2.2k views
ADD COMMENTlink modified 3.6 years ago by John12k • written 3.6 years ago by abalakrishnan950
1

I am having trouble to understand you question? Are you looking for a single unique k-mer or a length at which all k-mers are unique? Could you elaborate?

ADD REPLYlink written 3.6 years ago by thackl2.6k
1

probability is ( 1 / 4^n ) (n = length of sequence). Find an n value which makes the denominator larger than your genome size.

ADD REPLYlink written 3.6 years ago by Adrian Pelin2.3k
1

There is no shortest length equation. This genome:

GCGCGC...GCTGC...GCGCGC

...has a shortest unique kmer of 1 (T).

On the other hand, if you remove the T, the genome has no unique kmers except for itself, and if it is circular, there are no unique kmers, period.  So, it's completely sequence-specific.  Though as Adrian points out, you can estimate the length at which a kmer might be unique.

 

For a specific genome, you can find it like this (using the BBMap package):

kmercountexact.sh in=genome.fa k=1 out=khist1.txt
kmercountexact.sh in=genome.fa k=2 out=khist2.txt
kmercountexact.sh in=genome.fa k=3 out=khist3.txt

...etc.  Eventually, the khist file will indicate that some kmers have a count of 1.  At that point, you know the answer.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Brian Bushnell16k
4
gravatar for John
3.6 years ago by
John12k
Germany
John12k wrote:

I once had to do this for a wet-lab challenge - "what is the two shortest primer pairs can can amplify something in the human genome".
You'd be surprised, even at 10bp there are quite a few unique 10mers which only crop up once in the genome:

x is kmer length, y is occurrences in human genome.
 

ADD COMMENTlink written 3.6 years ago by John12k

How did you compute this?

ADD REPLYlink written 3.6 years ago by reza.jabal320
3

To be honest, I no longer have the code, but the method was to take a given window size, say 10bp, then take the first 10bp of the genome fasta file and either add it to a hashtable, or, if it was already in the hashtable, mark the existing entry for that fragment in the table as repeated, or if its already marked as repeated, do nothing.

The result was apparently 1,000,000 fragments marked as repeated, and a couple of 1000 marked as unique. I repeated the process for a bunch of different window sizes, which is what gave this graph. There are 101 ways this could have been improved, but at the time I had just started learning python, and this was a nice experiment.

ADD REPLYlink written 3.6 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1684 users visited in the last hour