Question: Assembly Strategy For Large(R) Genome
gravatar for burkhart.joshua
7.0 years ago by
United States
burkhart.joshua30 wrote:

I was recently tasked by my lab to assemble a genome estimated at 850,000,000 bp. This is de novo and I've only had experience with Velvet. I'm thinking about moving to SGA or SOAPdenovo but would first like to know what the community would recommend. I have some sense of Velvet's computational limits (RAM / Execution Time) but not of other assemblers. Any suggestions?

genome assembly velvet • 3.0k views
ADD COMMENTlink written 7.0 years ago by burkhart.joshua30

This review might get you started

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Irsan7.0k
gravatar for Josh Herr
7.0 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

The lab I work in is sequencing some very large plant genomes (mostly trees) that are all anywhere from 8.5 Gbp to 20 Gbp in length. It's not easy as when you have a genome that large you inadvertently have a lot of duplications, repeats, heterozygosity, and adding on sequencing errors this can make things pretty hellish. Good quality sequencing data is absolute key.

We've used Velvet on a large cluster, but we still have memory issues and de novo runs can take weeks and still end up crashing. We have used both SOAPdenovo and CLC and have tried those too, but honestly I don't know if I would recommend one versus the other at this point. We're far from solving the issue.

This issue is difficult (like another area I am working in Metagenomic shotgun sequencing assembly) and requires a ton of memory. We're just starting to use Titus Brown's Diginorm script (his blog post with links, github) to try to reduce the memory load.

Don't know if any of this helps, but it's what we're up against.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Josh Herr5.7k

Upvote for the diginorm-script - we use a similar script inhouse and it greatly reduces the amount of time needed for the assembly, and vastly improves the quality of the assembly. We mostly use Velvet (and for Wheat and Brassica and have gotten some good results in the past.

Addendum: Trimming low-quality bases from reads (like, cut off the tail of the read if the quality drops below 20) gives you a faster and better analysis, as well.

ADD REPLYlink written 7.0 years ago by Philipp Bayer6.5k

Yes, Upvote for the quality control on the sequences. Trimming and purging bad reads is very important. Really important point.

ADD REPLYlink written 7.0 years ago by Josh Herr5.7k
gravatar for rtliu
7.0 years ago by
New Zealand
rtliu2.0k wrote:

We will be facing similar challenges in the near future. If I had time, I would like to try:

  1. SGA - String Graph Assembler, assemble human genome in 54 GB of memory. Genome Research Paper
  2. fermi - "Fermi is substantially influenced by SGA. It follows a similar workflow, including the idea of contrasting read sets. On the other hand, the internal implementation of fermi is distinct from that of SGA. Fermi is based on a novel data structure and uses different algorithms for almost every step. As to the end results, fermi has a similar performance to SGA for features shared between them, and is arguably easier to use" - Author Heng Li
  3. MSR-CA : The MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. Manual, The software is under active development. It is the main assembler for Pinus taeda genome (~24Gbp).
ADD COMMENTlink written 7.0 years ago by rtliu2.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1805 users visited in the last hour