How to determine coverage cutoffs in k-mer distribution
1
0
Entering edit mode
3.4 years ago
pablo ▴ 300

Hello,

Could someone explain me clearly why do we set cutoffs coverage in kmer distribution in order to improve assemblies?

And so, how to determine these cutoffs?

Best

kmer distribution kmer Assembly • 1.2k views
ADD COMMENT
4
Entering edit mode
3.4 years ago
Mensur Dlakic ★ 27k

K-mers may be in low abundance because they occur rarely in the genome, and in addition to that were not sequenced many times. A more likely explanation is that rare k-mers come from sequencing errors. You can probably find a statistical proof for that by Googling, but it should be pretty intuitive that k-mers that occur only once or twice are more likely to come from sequencing errors than be real.

Cutoffs are chosen such that we exclude as many k-mers as possible that result from sequencing errors. At the same time, we don't want to throw away the reads with truly rare k-mers. The exact number is determined from k-mer distribution and overall sequencing coverage.

This paper may help:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5272-y

ADD COMMENT
0
Entering edit mode

Thanks a lot, that helps! I asked that because I used the tool purged_dups to improve my assembly. One step of purge_dups pipeline is estimating these cutoffs , stored in a file like that : 5 13 21 25 42 75 I understand the first one (5) is to remove the kmers associated with the sequencing errors, and the last one to remove the kmers associated with high coverage = repeats. Do you know why do we need the 4 others,?I mean, only the first and last ones could be enough?

ADD REPLY
1
Entering edit mode

I don't know the exact answer to your question because I never used that tool. My guess is that this is akin to significance thresholds that are used to reject the null hypothesis. While 0.05 is good enough by most standards, the confidence will be greater if it goes below 0.01. If we apply that logic, the first cutoff at 5 would remove the majority of sequencing errors. If you wanted an assembly that is even more accurate at the expense of being less complete, you'd go for the next higher cutoff.

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6