Question: Does it make any sense to make a kmer analysis in assembled sequences?
gravatar for al.bodrug
6 months ago by
al.bodrug50 wrote:

Hello everyone,

Most of us working on large repetitive genomes are probably familiar with the kmer distribution analysis on raw short reads, where we find the peak for the diploid portion of the genome, a hunch in case of polyploidy and a smaller flatter peak for the duplicated portion of the genome. This is usually done to make kmer based genome size estimation.

My question is, does it make sense to look at distribution of kmers in already assembled sequences? And if it does make sense, is it more logical to use large or short kmers?

I looked at the 7mer, 21mer, 55mer and 155mer distribution in an assembly of beet (Plant, eudicotyledon). It's 'just' a peakless descending curve, where sometimes a hunch is distinguishable. On a biological level, is this anyway informative?

Cheers, Alex

sequence assembly genome • 242 views
ADD COMMENTlink modified 6 months ago by Corentin450 • written 6 months ago by al.bodrug50

Not sure about your exact question, but I've used KAT to compare k-mers from short reads used to make an assembly to the actual assembly to get an idea of the assembly quality, see

ADD REPLYlink written 6 months ago by jean.elbers1.3k
gravatar for Corentin
6 months ago by
Corentin450 wrote:

This does not really make sense, kmer analysis is more useful when applied to reads:

The assemblies are often representing only one haplotype, so you will not be able to guess the ploidy from the assembly.

Do not forget that the x-axis on the kmer plot represent the frequency of the kmer (how many time it appears in your sequence), this is often used to assess the read coverage. However, in an assembly you have a "coverage of 1" (apart from the repeat sequences), this explain the peakless curve.

However, as jean.elbers mentioned in the comments, if you have access to the raw reads you can perform the k-mer analysis on them, and with KAT (Kmer Analysis Tool) you can compare the kmer content of your reads against the assembly to assess the completeness and duplication levels.

Large or short k-mers depend on the genome, 7 seems very short though (the assumption is that kmers should represent a unique sequence, if you are choosing a short kmer you may have several kmers with the same sequence).

Here is a tutorial for genome size estimation from a kmer analysis (but there are plenty of other resources online):

Not directly related to your question but still may be of interest to you, the effect of kmer size in assembly:

ADD COMMENTlink modified 6 months ago • written 6 months ago by Corentin450
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1616 users visited in the last hour