Estimate K-mer size for de novo assembly
1
0
Entering edit mode
2.1 years ago

I want to estimate K-mer size before performing de novo assembly for paired Illumina reads (using SoapDenovo2). My reads length is 151bp.

What are the best K-mer estimation software? I've tried kmergenie using conda but it exited with an error: ModuleNotFoundError: No module named 'readfq'

Hence, I'm looking for an alternative or to fix the error.

Soapdenovo2 accepts odd numbers between 13 and 31. However, according to discussions (How To Choose The K Value Of Kmer In Soapdenovo?), it seems that the K-mer size should be 1/2 to 2/3 of read length, which in my case would be ~75-90, exceeding the soapdenovo2 threshold.

What are your suggestions?

de novo assembly soapdenovo estimation kmergenie K-mer • 2.4k views
ADD COMMENT
0
Entering edit mode

The thread you linked is for a different version of soapdenovo.

Please follow what the program version you plan to use accepts. If soapdenovo2 wants a number between 13 and 31 you are not going to be able to use a number that is outside those bounds. It is sometimes worth trying multiple runs out to see what works best than trying to start with what seems to be an optimal setting. Every dataset is different and general recommendations may not always produce the best result.

ADD REPLY
0
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 27k

I have already answered this question indirectly in one of your previous queries, although it was a different context.

Most modern assemblers know how to pick the best k-mer size as long as they are given enough options to work with. SPAdes has a -k option which by default is set to auto, and the program will sample various k-mer sizes before picking the best. Since you seem to be a fan of error-correction, for corrected data you can specify the --only-assembler option since there is no need to correct anything. Personally, I would give the program uncorrected data and let it do its own error correction. A last piece of advice regarding this assembler: you may get tempted to use the --careful option, but for most datasets that will be unnecessary. In my hands that option will yield better results only for single genomes sequences at extremely high depth.

Same advice for MEGAHIT: it has several options to specify k-mers as a list, as a min-max range with fixed steps, or as a preset group of numbers. If no option is chosen, it will sample [21,29,39,59,79,99,119,141] which is a sensible option because it covers a huge range of k-mers. While it is worth checking other options for more customization, one can't go wrong by going with default values and letting the assemblers figure it out. It is worth feeding error-corrected reads to this assembler, and it will generally do better with corrected than with raw reads.

ADD COMMENT
0
Entering edit mode

Thank you for the information! I'm currently attempting to use SPAdes

with prior correction:

for f in `ls -1 *_1.fq.gz | sed 's/_1.fq.gz//’`;
do spades.py --only-assembler -o ../denovo_assembly/corrected -1 $f\_1.fq.gz -2 $f\_2.fq.gz -t 20;
done

and without prior correction

for f in `ls -1 *_1.fq.gz | sed 's/_1.fq.gz//’`;
Do spades.py -o ../denovo_assembly/not_corrected -1 $f\_1.fq.gz -2 $f\_2.fq.gz -t 20;
done

It returned a warning:

Too many erroneous kmers, the estimates might be unreliable
ADD REPLY
0
Entering edit mode

I think your original question has been answered, and I don't really see a question in this latest post. If you are wondering about the warning, it is telling you something about your data. It may be a good idea to assemble both ways and compare the assemblies, but that goes beyond k-mer size selection.

ADD REPLY

Login before adding your answer.

Traffic: 1421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6