I was recently tasked by my lab to assemble a genome estimated at 850,000,000 bp. This is de novo and I've only had experience with Velvet. I'm thinking about moving to SGA or SOAPdenovo but would first like to know what the community would recommend. I have some sense of Velvet's computational limits (RAM / Execution Time) but not of other assemblers. Any suggestions?
The lab I work in is sequencing some very large plant genomes (mostly trees) that are all anywhere from 8.5 Gbp to 20 Gbp in length. It's not easy as when you have a genome that large you inadvertently have a lot of duplications, repeats, heterozygosity, and adding on sequencing errors this can make things pretty hellish. Good quality sequencing data is absolute key.
We've used Velvet on a large cluster, but we still have memory issues and de novo runs can take weeks and still end up crashing. We have used both SOAPdenovo and CLC and have tried those too, but honestly I don't know if I would recommend one versus the other at this point. We're far from solving the issue.
This issue is difficult (like another area I am working in Metagenomic shotgun sequencing assembly) and requires a ton of memory. We're just starting to use Titus Brown's Diginorm script (his blog post with links, github) to try to reduce the memory load.
Don't know if any of this helps, but it's what we're up against.
We will be facing similar challenges in the near future. If I had time, I would like to try:
- SGA - String Graph Assembler, assemble human genome in 54 GB of memory. Genome Research Paper
- fermi - "Fermi is substantially influenced by SGA. It follows a similar workflow, including the idea of contrasting read sets. On the other hand, the internal implementation of fermi is distinct from that of SGA. Fermi is based on a novel data structure and uses different algorithms for almost every step. As to the end results, fermi has a similar performance to SGA for features shared between them, and is arguably easier to use" - Author Heng Li
- MSR-CA : The MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. Manual, The software is under active development. It is the main assembler for Pinus taeda genome (~24Gbp).