Question: Does it make any sense to make a kmer analysis in assembled sequences?
0
gravatar for al.bodrug
6 weeks ago by
al.bodrug50
France
al.bodrug50 wrote:

Hello everyone,

Most of us working on large repetitive genomes are probably familiar with the kmer distribution analysis on raw short reads, where we find the peak for the diploid portion of the genome, a hunch in case of polyploidy and a smaller flatter peak for the duplicated portion of the genome. This is usually done to make kmer based genome size estimation.

My question is, does it make sense to look at distribution of kmers in already assembled sequences? And if it does make sense, is it more logical to use large or short kmers?

I looked at the 7mer, 21mer, 55mer and 155mer distribution in an assembly of beet (Plant, eudicotyledon). It's 'just' a peakless descending curve, where sometimes a hunch is distinguishable. On a biological level, is this anyway informative?

Cheers, Alex

sequence assembly genome • 137 views
ADD COMMENTlink modified 6 weeks ago by Corentin430 • written 6 weeks ago by al.bodrug50
2

Not sure about your exact question, but I've used KAT to compare k-mers from short reads used to make an assembly to the actual assembly to get an idea of the assembly quality, see https://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra

ADD REPLYlink written 6 weeks ago by jean.elbers1.3k
1
gravatar for Corentin
6 weeks ago by
Corentin430
Corentin430 wrote:

This does not really make sense, kmer analysis is more useful when applied to reads:

The assemblies are often representing only one haplotype, so you will not be able to guess the ploidy from the assembly.

Do not forget that the x-axis on the kmer plot represent the frequency of the kmer (how many time it appears in your sequence), this is often used to assess the read coverage. However, in an assembly you have a "coverage of 1" (apart from the repeat sequences), this explain the peakless curve.

However, as jean.elbers mentioned in the comments, if you have access to the raw reads you can perform the k-mer analysis on them, and with KAT (Kmer Analysis Tool) you can compare the kmer content of your reads against the assembly to assess the completeness and duplication levels.

Large or short k-mers depend on the genome, 7 seems very short though (the assumption is that kmers should represent a unique sequence, if you are choosing a short kmer you may have several kmers with the same sequence).

Here is a tutorial for genome size estimation from a kmer analysis (but there are plenty of other resources online): https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

Not directly related to your question but still may be of interest to you, the effect of kmer size in assembly: https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Corentin430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1133 users visited in the last hour