I'm currently involved in the assembly (illumina data) of a few species with genome size 10-26Gb. I'm using the ABySS assembly software, mainly due to it's excellent ability to scale on large compute clusters and of course because it gave good results in the past. To determine the Kmer to use for the assembly I'm running the pipeline up to the unitig stage with different Kmer and then evaluate which Kmer will work best because running the whole pipeline on all data is rather unfeasible. I now started wondering whether this a valid approach. More specifically is the performance at the unitig level a good representation of the performance/result for the whole process (== up to the contig or even scaffold level)?
Would I be better of with running the whole pipeline but for example only using 1 pair of input sequences (I think not because then the coverage, or better lack of, will become an issue).
Anybody has an idea or experience with this (or perhaps has a comparison of unitig vs contig (scaffold?) performance)?