Question

How to determine coverage cutoffs in k-mer distribution

0

Entering edit mode

3.4 years ago

pablo ▴ 300

Hello,

Could someone explain me clearly why do we set cutoffs coverage in kmer distribution in order to improve assemblies?

And so, how to determine these cutoffs?

Best

kmer distribution kmer Assembly • 1.2k views

ADD COMMENT • link updated 3.4 years ago by Mensur Dlakic ★ 27k • written 3.4 years ago by pablo ▴ 300

score 4 · Accepted Answer · 2020-12-08

4

Entering edit mode

3.4 years ago

Mensur Dlakic ★ 27k

K-mers may be in low abundance because they occur rarely in the genome, and in addition to that were not sequenced many times. A more likely explanation is that rare k-mers come from sequencing errors. You can probably find a statistical proof for that by Googling, but it should be pretty intuitive that k-mers that occur only once or twice are more likely to come from sequencing errors than be real.

Cutoffs are chosen such that we exclude as many k-mers as possible that result from sequencing errors. At the same time, we don't want to throw away the reads with truly rare k-mers. The exact number is determined from k-mer distribution and overall sequencing coverage.

This paper may help:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5272-y

ADD COMMENT • link 3.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks a lot, that helps! I asked that because I used the tool purged_dups to improve my assembly. One step of purge_dups pipeline is estimating these cutoffs , stored in a file like that : 5 13 21 25 42 75 I understand the first one (5) is to remove the kmers associated with the sequencing errors, and the last one to remove the kmers associated with high coverage = repeats. Do you know why do we need the 4 others,?I mean, only the first and last ones could be enough?

ADD REPLY • link 3.4 years ago by pablo ▴ 300

1

Entering edit mode

I don't know the exact answer to your question because I never used that tool. My guess is that this is akin to significance thresholds that are used to reject the null hypothesis. While 0.05 is good enough by most standards, the confidence will be greater if it goes below 0.01. If we apply that logic, the first cutoff at 5 would remove the majority of sequencing errors. If you wanted an assembly that is even more accurate at the expense of being less complete, you'd go for the next higher cutoff.

ADD REPLY • link 3.4 years ago by Mensur Dlakic ★ 27k