All, I am curious whether anyone out there has a method for assessing the quality and accuracy of de novo genome assemblies? I am currently doing in silico simulations of de novo genome assembly from a previously sequenced genome to determine the best assembly parameters (K-mer size, coverage cutoff etc) and optimal dataset (mate pair library size, coverage etc). The ultimate goal will be to use these parameters to assemble the genome of a related species, de novo.
However, the difficulty is that after simulating the data and making a de novo assembly I don't know of any statistics or methods to compare the assembled contigs back to original sequence that they were simulated from. This requires two steps (1) align assembled contigs to reference genome (2) assess the fit
People often optimize N50, assembly size, contig number and other length-based measurements - but this only makes for bigger and bigger contigs and there is little information about whether these contigs are accurate. I have been using BLAST to compare the contigs to the reference and asking how well they fit, how long the alignments are and how many mis-assembled contigs there are. If anyone has ideas or methods for assessing the accuracy ( or overall similarity of an assembly and a genome) I would be grateful to hear about it.