I am assembling bacterial genomes (~6Mb) using 250 bp paired-end MiSeq data. I have tried a bunch of assemblers (idba_ud, mira, ray, SOAPdenovo, ABySS to name a few...), but am getting reasonably good results using good old velvet (~360 contigs, n50 = 40kb). But I have a question about how to set the velvetg parameter
-max_coverage? It's value has a large effect on the resulting number of contigs and total number of bases in the assembly (ie assembled genome size). Am I correct in thinking that many of these high-coverage nodes errors (or at least error-prone, like repeat elements etc) and should be excluded for a better assembly?
I estimate the coverage distribution (in R using plotrix) from the stats.txt file after running a preliminary:
velvetg velvet_big_127 -cov_cutoff auto -exp_cov auto. It is then easy to calculate the weighted mean coverage
-exp_cutoff and to set a reasonable value for
-cov_cutoff, but there is often a long tail in the distribution meaning that there are small number of nodes with very high coverage.
Generally, what is a good way to determine a sensible value for
Many thanks! Reuben