I am doing de novo assembly of some metagenomic datasets. They are Illumina NextSeq reads (paired-end, 150 bp per read). I have tried IDBA-UD and SPAdes so far. Both of them gave me a final N50 values of some thousands, which is not too bad but still below my expectation.
I noticed that I can manually set the k-mer sizes used in each iteration. In SPAdes, the recommended k-mer sizes are 21, 33, 55, 77, and in IDBA-UD the default is 20 to 100 with an increment of 20. I changed IDBA-UD's maximum k-mer size from 100 to 240, and the final N50 value is significantly higher. Below is a plot of the metrics per iteration (x-axis):
My questions are:
1) I feel that larger maximum k-mer size does perform better than smaller ones, since the N50 values grows almost linearly, without notably compromising total length. Am I right?
2) Based on the figure, is there any improvement I can possibly make? (e.g., further increase max k-mer size, decrease increment, etc).
3) What other tips do you suggest me to play with?
Thanks and you all have a great day.
== update ==================
Here is another plot of the distribution of resulting contig sizes at different maximum k-mer sizes by IDBA-UD. It looks to me that the performance is indeed 240 > 180 > 120, because the whole curve moves right without changing the shape much. Am I right?