Question

What Assembler To Use For Eukaryotes?

10

Entering edit mode

14.4 years ago

Ketil 4.2k

I'm in the process of assembling a medium size (roughly half a gigabase) eukaryote genome, but I'm running into difficulties. We have both 454 (Titanium) reads and Illumina (2x100) reads. It's easy to find suggestions of software to use, but usually they're not corroborated by actual results, and in particular it's hard to find information on what does not work. Thus, I would like to have more information, especially from independent groups (i.e. not developing the assembler or sequencing technology), including:

size of genome
type and amount (coverage) of sequencing
assemblers tried, and results (N50, or any useful other metrics)

Our results so far:

Genome size: 500-700Mb
Illumina at 30x, 454 at 15x (shotgun only, no mate pairs yet)
Newbler (13x 454) - n50 of 4.5k clc (illumina only) n50 of 6k clc (all data) - n50 of 14k

Other results: Celera/CABOG hasn't terminated yet, SOAP runs quickly but has so far delivered terrible results, CLC on 454 data performs worse than Newbler.

assembly next-gen sequencing software • 9.2k views

ADD COMMENT • link updated 14.2 years ago by Yannick Wurm ★ 2.5k • written 14.4 years ago by Ketil 4.2k

score 9 · Answer 1 · 2010-10-28

Did you see Ewan's summary? Along with comments at Gattaca:

Velvet's algorithms in theory work for any size. However, the engineering aspects of Velvet, in particular memory consumption, means it's unable to handle read sets of a particular size. This of course depends on how big a real memory machine you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes (~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory size is not just about the size of the genome but also how error prone you reads are (though sheer size is important).

Beyond this there are a variety of strategies:

"Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published, well understood, from the BC genome centre).

Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions and thus provide an improved more read-pair aware graph. This can be iterated and in at least some cases, the Curtain approach gets close to what Velvet can produce alone (in the scenarios where Velvet can be run on a single memory machine to understand Curtain's performance)

SOAP de novo from the BGI is responsible for a number of the published assemblies (eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would definitely be asking the BGI guys for advice.

A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and 500MB are theoretically doable, and have been done by specific groups, but I think still are at the leading edge of development and one should not be confident of success for "any particular genome".

Additional pointers to Cortex are on the linked weblog, and Contrail is worth a try for genomes of the size you describe.

Ram · Answer 2 · 2010-10-29

I ran into these slides today by Alberto Policriti. He addresses some of your questions and compares a few of the assemblers (abyss, SoapDeNovo, CLC) on a 500Mbp genome. You might find the results useful.

As you're aware, there is a lot of parameter tweaking that can go into getting a good assembly, particularly with de Bruijn assemblers. If you choose to try abyss, be sure to sign up for the abyss mailing list. Shaun Jackman does a great job of supporting abyss and helping people get the best assembly out of the program. Disclaimer: I was one of the abyss authors.

score 0 · Answer 3 · 2010-10-28

0

Entering edit mode

14.4 years ago

Lhl ▴ 760

Are you doing de-novo assembly?

ADD COMMENT • link 14.4 years ago by Lhl ▴ 760

0

Entering edit mode

please add this as comment then delete this answer

ADD REPLY • link 14.4 years ago by Istvan Albert 102k

0

Entering edit mode

Yes, this is de novo, not resequencing.

ADD REPLY • link 14.4 years ago by Ketil 4.2k

score 0 · Answer 4 · 2010-12-02

I started using ABYSS and IDBA for de-novo assembly of some of the Drosophila species/strains. My initial experience suggests IDBA is performing well over Abyss. It has flexibility of checking K-mers over a range within a given minimum to maximum K values. As well as run time and RAM requirements are less. (correct me if iam wrong)

score 0 · Answer 5 · 2011-05-01

0

Entering edit mode

13.9 years ago

Yannick Wurm ★ 2.5k

We

used SOAP for Illumina assembling and gapclosing
then chopped Illumina reads into 300bp overlapping fragments
added these choped Illumina reads to 454 newbler alongside real 454 reads

http://www.pnas.org/content/early/2011/01/24/1009690108.abstract

ADD COMMENT • link 13.9 years ago by Yannick Wurm ★ 2.5k