Question: What Assembler To Use For Eukaryotes?
10
gravatar for Ketil
8.7 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

I'm in the process of assembling a medium size (roughly half a gigabase) eukaryote genome, but I'm running into difficulties. We have both 454 (Titanium) reads and Illumina (2x100) reads. It's easy to find suggestions of software to use, but usually they're not corroborated by actual results, and in particular it's hard to find information on what does not work. Thus, I would like to have more information, especially from independent groups (i.e. not developing the assembler or sequencing technology), including:

  1. size of genome
  2. type and amount (coverage) of sequencing
  3. assemblers tried, and results (N50, or any useful other metrics)

Our results so far:

  1. Genome size: 500-700Mb
  2. Illumina at 30x, 454 at 15x (shotgun only, no mate pairs yet)
  3. Newbler (13x 454) - n50 of 4.5k clc (illumina only) n50 of 6k clc (all data) - n50 of 14k

Other results: Celera/CABOG hasn't terminated yet, SOAP runs quickly but has so far delivered terrible results, CLC on 454 data performs worse than Newbler.

ADD COMMENTlink modified 8.5 years ago by Yannick Wurm2.3k • written 8.7 years ago by Ketil3.9k
9
gravatar for Fiamh
8.7 years ago by
Fiamh220
Boston, MA
Fiamh220 wrote:

Did you see Ewan's summary? Along with comments at Gattaca:

Velvet's algorithms in theory work for any size. However, the engineering aspects of Velvet, in particular memory consumption, means it's unable to handle read sets of a particular size. This of course depends on how big a real memory machine you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes (~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory size is not just about the size of the genome but also how error prone you reads are (though sheer size is important).

Beyond this there are a variety of strategies:

"Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published, well understood, from the BC genome centre).

Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions and thus provide an improved more read-pair aware graph. This can be iterated and in at least some cases, the Curtain approach gets close to what Velvet can produce alone (in the scenarios where Velvet can be run on a single memory machine to understand Curtain's performance)

SOAP de novo from the BGI is responsible for a number of the published assemblies (eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would definitely be asking the BGI guys for advice.

A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and 500MB are theoretically doable, and have been done by specific groups, but I think still are at the leading edge of development and one should not be confident of success for "any particular genome".

Additional pointers to Cortex are on the linked weblog, and Contrail is worth a try for genomes of the size you describe.

ADD COMMENTlink written 8.7 years ago by Fiamh220

No, I didn't see it before, thanks for reposting it. I've read the Panda paper, and it looks like they had to spend a lot of effort to get decent results - the initial assembly, based on 39x regular PE coverage had a meagre 1.5K n50. My efforts with SOAPdenovo aren't very positive, but it is using a de Bruijn approach, so perhaps it will fare better if I filter the data heavily? (my opinion is that de Bruijn is very difficult to get right, especially with noisy data)

Although I appreciate the answer, most of this is what I did not want: general/vague success stories from the developers :-)

ADD REPLYlink written 8.7 years ago by Ketil3.9k

Ketil, realize that -- but from what I can tell you are in mostly uncharted territory, something reflected by Ewan's comments. Short of hunting down people struggling with the same problem you'll probably have a hard time finding success stories. The SeqAnswers forum might be a better place of researchers in a similar situation.

ADD REPLYlink written 8.7 years ago by Fiamh220

Just a small update, I finally got around to try velvet, but it got killed after running out of memory. This was on a fraction of the data, and sadly, I only have 144GB RAM available. So velvet is out, I think.

ADD REPLYlink written 8.3 years ago by Ketil3.9k
2
gravatar for Jts
8.7 years ago by
Jts1.2k
Jts1.2k wrote:

I ran into these slides [1] today by Alberto Policriti. He addresses some of your questions and compares a few of the assemblers (abyss, SoapDeNovo, CLC) on a 500Mbp genome. You might find the results useful.

As you're aware, there is a lot of parameter tweaking that can go into getting a good assembly, particularly with de Bruijn assemblers. If you choose to try abyss, be sure to sign up for the abyss mailing list. Shaun Jackman does a great job of supporting abyss and helping people get the best assembly out of the program. Disclaimer: I was one of the abyss authors.

[1] http://mi.caspur.it/workshop_NGS10/pres/Policriti.pdf

ADD COMMENTlink written 8.7 years ago by Jts1.2k
0
gravatar for Lhl
8.7 years ago by
Lhl730
United States
Lhl730 wrote:

Are you doing de-novo assembly?

ADD COMMENTlink written 8.7 years ago by Lhl730

please add this as comment then delete this answer

ADD REPLYlink written 8.7 years ago by Istvan Albert ♦♦ 80k

Yes, this is de novo, not resequencing.

ADD REPLYlink written 8.7 years ago by Ketil3.9k
0
gravatar for Rm
8.6 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

I started using ABYSS and IDBA for de-novo assembly of some of the Drosophila species/strains. My initial experience suggests IDBA is performing well over Abyss. It has flexibility of checking K-mers over a range within a given minimum to maximum K values. As well as run time and RAM requirements are less. (correct me if iam wrong)

ADD COMMENTlink written 8.6 years ago by Rm7.9k
0
gravatar for Yannick Wurm
8.2 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

We

  1. used SOAP for Illumina assembling and gapclosing
  2. then chopped Illumina reads into 300bp overlapping fragments
  3. added these choped Illumina reads to 454 newbler alongside real 454 reads

http://www.pnas.org/content/early/2011/01/24/1009690108.abstract

ADD COMMENTlink written 8.2 years ago by Yannick Wurm2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour