Question

What Is The Most Memory-Efficient De Novo Assembler?

8

Entering edit mode

13.0 years ago

toshnam ▴ 650

Hi all,

I should assemble the hiseq2000 read set (558 million PE reads) on linux server which is consisted of 16 core and 128G RAM.

I've been thinking the SOAPdenovo is the most memory-efficient de novo assembler, but my server can't assemble using SOAPdenovo. I guess RAM capacity is not sufficient.

What is the most memory-efficient de novo assembler for eukaryote genome?

Thanks in advance.

assembly hiseq memory • 12k views

ADD COMMENT • link updated 13.0 years ago by Shaldenby ▴ 10 • written 13.0 years ago by toshnam ▴ 650

score 5 · Answer 1 · 2011-04-21

It's difficult to compare different assemblers in a fair, meaningful way.

However, Shen et al. made a good attempt and recently published: "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies". If you can cope with the horrible 3D plots in their paper, Figures 2 and 3 indicate memory usage for 9 assemblers.

score 4 · Answer 2 · 2011-04-21

4

Entering edit mode

13.0 years ago

Haibao Tang 3.0k

CLC de-novo assembler. In their example, 21Gb RAM for 38X human.

ADD COMMENT • link 13.0 years ago by Haibao Tang 3.0k

0

Entering edit mode

I second this, I was suprised how low the memory usage was - now it was a small bacterial genome of 4Gb but even that would make another approach use 10Gb ram whereas the CLC assembler was around 1Gb or less.

ADD REPLY • link 12.6 years ago by Istvan Albert 100k

0

Entering edit mode

I second this, I've recently worked with it and I was surprised just how low the memory usage was - now it was a small bacterial genome but even that would make another approach use 10Gb ram whereas the CLC assembler was around 1Gb or less.

ADD REPLY • link 12.6 years ago by Istvan Albert 100k

0

Entering edit mode

I am happy with CLC's performance. I also found it to give me best contig N50 compared to velvet/soap. However, a big issue for CLC denovo assembler is it doesn't do scaffolding, so I am stuck at small contigs. For the genomes I work with, I would like to grow the contigs to as large as possible.

ADD REPLY • link 12.6 years ago by Haibao Tang 3.0k

score 3 · Answer 3 · 2011-04-21

3

Entering edit mode

13.0 years ago

Jan Van Haarst ▴ 300

One of the most important steps in limiting your RAM consumption is filtering your input data.

Every kmer that your dataset produces will take up space in the de Bruyn graph, and thus removing kmers that are created because of read errors will shrink your memory recuirements tremendously.

In our lab we could cut the used memory in half by filtering the input data.

We have used Jellyfish with some scripts of our own, but other packages available are Quake , khmer and some stuff from BGI.

ADD COMMENT • link 13.0 years ago by Jan Van Haarst ▴ 300

0

Entering edit mode

How did you go about extracting reads from the jellyfish output? Say you would want to ignore kmers with counts 1-6? I wrote something myself buts it is very slow. It's a shame helly loses read info from the kmers.

ADD REPLY • link 12.7 years ago by Louis Letourneau ▴ 820

score 3 · Answer 4 · 2011-04-22

3

Entering edit mode

13.0 years ago

Benm ▴ 710

I think SOAPdenovo in short Paired-Ends reads denovo assembly perform well, 128GB for 558 million PE reads maybe sufficient to run SOAPdenovo, but most important thing is you need to do "Error Correction" before you run the programs of constructing contigs and scaffolding. After error correction process, that would be fine, and you will find it would cost less memory. There is the error correction tool in soapdenvo download website. And you can choose the third party contributions, such as Euler-SR, etc. If you reads are mixed set, there is a latest reference you may follow: Leena Salmela, Correction of sequencing errors in a mixed set of reads. Bioinformatics, Vol. 26 no. 10 2010, pages 1284–1290.

ADD COMMENT • link 13.0 years ago by Benm ▴ 710

1

Entering edit mode

Do you mean "Correction tool for SOAPdenovo (Version 20090703)" on the homepage (http://soap.genomics.org.cn/index.html)? I'm going to run "KmerFreq", "Corrector", "merge_pair.pl", and "merge_pair_list.pl" as your suggestion.

ADD REPLY • link 13.0 years ago by toshnam ▴ 650

0

Entering edit mode

I've stopped using SOAPCorrector because KmerFreq crashes on my fastq read files ! I am actually trying Quake...

ADD REPLY • link 12.4 years ago by Frédéric Bigey ▴ 310

Ram · Answer 5 · 2011-04-22

1

Entering edit mode

13.0 years ago

Kevin ▴ 640

http://kevin-gattaca.blogspot.com/2010/10/de-novo-assembly-of-large-genomes.html U did not mention your genome size but Cortex might be the kind of software that you are looking for if u do not have clc bio

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Kevin ▴ 640

score 1 · Answer 6 · 2011-09-19

1

Entering edit mode

12.6 years ago

Shaldenby ▴ 10

I think that the CLC assembler is pretty much the leader at the moment

ADD COMMENT • link 12.6 years ago by Shaldenby ▴ 10

Ram · Answer 7 · 2011-04-22

0

Entering edit mode

13.0 years ago

Rm 8.3k

what about IDBA: A Practical Iterative De Bruijn Graph De Novo Assembler

http://code.google.com/p/hku-idba/

Any one worked with this tool?

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Rm 8.3k