Hi folks. I need to run a de novo short-read genome assembler (on a paired-end/mate-pair library) that prefers outputting shorter but error-free contigs rather than longer contigs/scaffolds which may be mis-assembled. What assembler or what specific setting in an assembler of choice do you recommend to yield such contigs (as error-free as possible and no contig overlappings)?
I think error free contigs depends on the quality of your data too and the contamination if any. It also depends on the repetitiveness of genome, level of polymorphism (inorder to know the correctness of contigs) and heterozygosity of the individual. SOAP contigs are short as they start from K+1 of your kmer. By increasing the min_abundance parameter in denovo assemblers, you can get more accurate contigs. Minia is definitely one of the ones to try out.
If you have lesser number of error-free reads, go for overlap assembler such as CAP3. This wouldn't work for a large number of reads due to memory constraints.
According to the first GAGE paper, SGA makes shorter, but very much correct contigs. See http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111.full.pdf
According to this paper in BMC Bioinformatics journal:
- For short read libraries (e.g. Illumina MiSeq): CLC bio assembler (CLC Assembly Cell) (commerical, free 2-week trial)
- For Roche 454 read libraries: Newbler (Roche)
These assemblers tend to break reads and contigs at repeat boundaries and place repeated elements into separate contigs. Hence we might have more conservative and better quality (less likely to be mis-assembled) contigs.