Confusion about the kmer coverage
1
2
Entering edit mode
5.6 years ago
Picasa ▴ 610

Hi,

So I've ran SPAdes with a kmer = 77 and my read length is 100bp.

This is one of my contigs:

>NODE_86_length_345_cov_0.615672
TCCTTTGACTCCTTTGACACTGACAAATTGGCTTCCATATTTTATACCTTAATCATCTAATTGGCTTCCATATTTTATACCTTAATCATCT...

So this contig has a length of 345bp and the last value is, from what I understand, the kmer coverage which is 0.61.

Based on that definition:

the k-mer coverage of a contig is the number of k-mer that map (with perfect identity) to that contig.

Now I am confused about that definition. In my case, how can it be possible that I have a kmer coverage of 0.61 since my contig should have been built by more than 1 kmer (because the total length is 345 bp) ?

kmer spades • 8.6k views
1
Entering edit mode
5.6 years ago
mastal511 ★ 2.1k

See the equation for calculation of kmer coverage in the velvet manual, section 5.1,

http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

It involves the nucleotide coverage, the read length and and the kmer length, so with a read length of 100 and a kmer length of 77, the kmer coverage would be (100-77+1)/100, times the nucleotide coverage, or 0.24 times the nucleotide coverage, so a kmer coverage of 0.62 would correspond to a nucleotide coverage of about 2.5.

0
Entering edit mode

Thanks for the link.

So a kmer coverage is not really interpretable ? better convert to base coverage ?

Anyway, the definition that I gave from the google link is false then ?

0
Entering edit mode

Of course kmer coverage is interpretable.

The assembler will be happy if you provide it with a aproximate value of genome size, because it will use that value to check the assembling procedure

And you can give that value of genome size by giving either the nucleotide coverage or the kmer coverage. Since this assembler uses Bruijn graphs with 77mer nodes, you can provide that nucleotide coverage in kmer coverage.

The actual formula is Ck = C (L-k+1)/L, where Ck = Kmer coverage C = nucleotide coverage L= The length of your reads k= the hash value you select for the Bruijn graph, in this case, 77

0
Entering edit mode

Thanks for your explanation, that's very useful.

However sorry but I still don't have my answer. As I said in my post, the definition I found for k-mer coverage is

the k-mer coverage of a contig is the number of k-mer that map (with perfect identity) to that contig.

This is not consistent with what I see, because I have a kmer coverage of 0.61 but it should be not possible because my contig size is 375 bp and my k-mer is 77.

0
Entering edit mode

It seems that there are two k-mer coverage definitions involved in this threat

1. A k-mer coverage taking into account the whole genome that is depending on the total of bases you got sequenced

2. A contig coverage. It is likely that this value tries to inform you about what is the proportion of the total coverage (of the k-mers) of this contig in particular over the total of contigs that the assembler has gotten. To validate this, you should process the first lane of each of the fasta files, extract the coverages values, sum them, and check if they are close to 100

0
Entering edit mode

The contig coverage (or kmer coverage) total sum in not equal to 100. For instance, some contigs have a coverage of 200 etc.