I am looking into assembly of 454 reads from a metagenomic sample into contigs for protein prediction (homology, de-novo gene finding).
In the papers and data sets that I looked at so far, people mostly focus on phylotyping and thus mostly rely on the raw reads. In cases where they assemble the reads, the assemblies are mediocre (huge percentage of singletons, only few contigs >2000bp). N50 barely exceeds the avg read length.
Now my question is, why are the assemblies so bad? I assume, that the coverage that is provided by a single 454 run (giving ~1m reads) is too low and together with 454's error model, newbler has a hard time to find enough overlaps.. I also tried mira assembler on one data set, but it's more or less the same result. Also velvet didn't work better at all on these reads.
So, does somebody has a suggestion on how to improve assembly? Another software? Have more runs and thus higher coverage?
I am grateful for your suggestions. Thanks!