Question: Best de novo assembler for insect genome ?
4.3 years ago
Picasa550 wrote:


I have an insect genome to assemble (max size: 500 MB) with illumina data composed of paired end and mate pair.

I'm thinking to use SOAPdenovo and Spades.

Do you have any recommendation of better assembler for my data ?

assembler • 2.2k views
ADD COMMENTlink modified 3.9 years ago by Chris Fields2.1k • written 4.3 years ago by Picasa550

There is no clear answer to that question, But you are adviced to use different assembler, I would suggest Abyss and SOAPdenovo; after that you can use as suggested by the answer of @harold.smith.tarheel N50 and/or align your read to the assembly to see how it behave (if many reads didn't aligned you probably miss some regions in your assembly) as you have paire-end and mate pair if you have concordant align reads low then you have rearrangements in your assembly, use relative specious to see how your assembly looks.

Also you can use tools like REAPR (for de novo assembly) , misFinder (identify mis-assemblies in an unbiased manner using reference and paired-end reads), QUAST

ADD REPLYlink modified 2.6 years ago • written 3.9 years ago by Medhat8.7k
4.3 years ago
United States
harold.smith.tarheel4.6k wrote:

"Best assembler" is in the eye of the beholder. What are your requirements? Longest NG50? Most comprehensive gene coverage? Accurate resolution of heterozygosity? Best long range connectivity? Most reads remapping to your assembly?

There is no single best assembler, or single best metric for determining the best assembly. I recommend the Assemblathon 2 paper for its discussion of assembly evaluation, as well as challenges posed by heterozygosity, repetitive sequences, etc.

ADD COMMENTlink written 4.3 years ago by harold.smith.tarheel4.6k

I know that paper, and test have been done with large eukaryote, while I try to assemble insect.

By best I mean, best N50 mostly

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Picasa550

Those vertebrate genomes were only 2X-3X larger than your insect (1.0-1.6 GB vs 500MB), so the sizes are comparable. And no single assembler gives consistently best NG50 across all data sets. That metric is strongly dependent upon the degree of heterozygosity and repetitive DNA, which varies by genome.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by harold.smith.tarheel4.6k
3.9 years ago
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

My suggestion would be to do some preliminary QC on the sequence data first, which may help dictate which assemblers you may want to look into. Run a k-mer analysis to determine the level of actual coverage and complexity of the data (you could use Jellyfish, khmer, and a whole slew of tools to generate this data). Also, we run preQC to give a more complete assessment.

This, plus what library types you have, normally helps dictate which assemblers may work best. If you have overlapping shotgun libraries and a genome with low heterozygosity, ALLPATHS-LG or DISCOVAR are great (with the latter you would need to scaffold with a separate tool). Which one depends on the length of the sequence data you have.

If the het. rate is pretty high you could give Platanus a go; we've had fairly reasonable luck with it on a few troublesome genomes. You can also use SOAPdenovo, though I believe it's now deprecated in favor of MEGAHIT (we haven't tried this one yet).

ADD COMMENTlink written 3.9 years ago by Chris Fields2.1k
3.9 years ago
Lina F180
Boston, MA
Lina F180 wrote:

Here is a recent paper discussing using DISCOVAR for insect assembly:

Might be helpful

ADD COMMENTlink written 3.9 years ago by Lina F180
