Question: Tool Recommendations For Human Genome Assembly
2
gravatar for Chris Cole
4.1 years ago by
Chris Cole620
Scotland
Chris Cole620 wrote:

Hi all,

I'm a bioinformatician with lots of NGS experience, but mostly with RNAseq and exomes. I'm looking at an upcoming project which involves the assembly of one or more human genomes. I have no experience in assembly, so what would people suggest I try?

I realise de novo assembly is non-trivial with such a large genome, but what about a reference guided assembly? Can the typical tools help with that, i.e. abyss, velvet or cortex?

Also, what kind of hardware requirements would I need? The biggest box I have access to has 24 cores and 128GB RAM.

Any suggestions gratefully received. Cheers,

Chris

genome assembly tools human • 3.3k views
ADD COMMENTlink modified 4.1 years ago by Rayan Chikhi1.1k • written 4.1 years ago by Chris Cole620

related: Genome assembly review papers

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum91k
3
gravatar for Rayan Chikhi
4.1 years ago by
Rayan Chikhi1.1k
France, Lille, CNRS
Rayan Chikhi1.1k wrote:

Minia will give you an accurate contigs assembly using very reasonable amounts of time (1 day) and memory (~ 6 GB). It is not a complete assembly pipe-line (paired-end and mate-pair information are not used). But you can run a scaffolder (such as SSPACE) after it.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Rayan Chikhi1.1k
2
gravatar for zam.iqbal.genome
4.1 years ago by
United Kingdom
zam.iqbal.genome1.5k wrote:

Looking at the options you mention:

  • velvet won't handle a whole human genome

  • Cortex doesn't do whole-genome consensus assembly - if that's what you want then Cortex is the wrong tool. If you want to find variants in this genome or between it and other samples or the reference, then Cortex can do the job, and I think you're better off using Cortex than doing a WG assembly and then mapping to it - unless you want to find very large (>10kb) heterozygous structural variants, where Cortex has low power, but an assembly might find things/hints (with a potentially high FDR, so you'd need to do a decent job of validating/studying/interpreting the results)

    • Abyss, or SGA or AllPaths-LG would be my tools of choice for standard WG human assembly. Or I might try fermi, but I'm embarrassed to admit I don't really know what kind of results fermi would give/how they would compare. Heng can comment more usefully on that.

The key is not to go out and look for the best assembler of all - nothing out there is the best at everything. Work out what you want to achieve with the assembly, and then go and assess the possible tools.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by zam.iqbal.genome1.5k
3

Fermi is not designed as a complete assembly package. It uses short-insert paired-end reads when assembling contigs, but it does not do scaffolding or use long-insert mate-pair reads. For contig assembly on the NA12878 data set, it is comparable to SGA and abyss in terms of N50 and misassembly rate (see preprint). With 128GB RAM, fermi should work with 45X coverage, at most 50X I would guess. SGA is more memory efficient. Another good assembler to try is SOAPdenovo2. When it is set up to use the sparse de Bruijn graph (a graph not keeping all k-mers), it can also assemble deep coverage in 128GB RAM.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by lh328k

Thanks very much Heng.

ADD REPLYlink written 4.1 years ago by zam.iqbal.genome1.5k
1

Ooops. AllPaths-LG won't fit into 128Gb of RAM (at least it needed 512Gb RAM in their paper). SGA ad Abyss and Fermi would all fit in 128Gb RAM

ADD REPLYlink written 4.1 years ago by zam.iqbal.genome1.5k

Many thanks for these replies, they're very helpful.

The aim is to produce a genome for an individual or more - not so much discover anything specific - from a short read sequencing run. It'll be a couple of lanes of HiSeq, so ~20x.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Chris Cole620
1
gravatar for Fabian Bull
4.1 years ago by
Fabian Bull1.2k
German
Fabian Bull1.2k wrote:

More of a comment but an answer but I though it could help you anyway.

The amount of computational power you need depends on the sequencing depth you are going to use. There are assemblers which will need much more RAM than you have. I remember reading in the LG-Allpaths manual, the time and RAM they needed for a human genome. Additionally some assemblers can not handle paired end data. My favorite assembler is CLC (unfortunatelly not free). If you are really getting serious there is no one from stoping you to try several assemblers and use the one producing the best output.

Additional tip: the most in important thing in genome assembly (and maybe in all bioinformatics stuff) is preprocessing. You have to trim for quality, correct sequencing errors (e.g. tools like Quake), check for contamination and check if the insert size provided by the sequencing lab is correct.

To the assembly itself: There is no reasion why you might do a de-novo assembly. People spend millions of dollar to come up with a human genome so use ist. A reference assembly might also be much less computational intensive.

ADD COMMENTlink written 4.1 years ago by Fabian Bull1.2k

I am interested in the 3rd-party evaluation of CLC on human data. I heard from my friends two years ago that CLC was overstating their performance that time. How about it now? Also, Jared published in the SGA paper that for de novo assembly, trimming leads to shorter N50. My experience is the same. The right strategy is to do quality-aware error correction as much as possible if the assembly algorithm itself cannot handle errors well. Most practical assemblers provide error correction tools (e.g. soapdenovo, sga, allpaths-lg, cortex and fermi; if I am right, none of them trim reads) or handle errors well (e.g. celera-assembler).

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by lh328k

IMHO, the advantage of SGA is a desirable scaling behavior and not better assembly performance.

I have never compared assembler very detailed (my boss did it in his thesis). The only thing I can tell, CLC produces by far the best N50 values.

ADD REPLYlink written 4.1 years ago by Fabian Bull1.2k

According to assemblathon1, sga is one of the best assemblers overall.

ADD REPLYlink written 4.1 years ago by lh328k

Thanks.

I agree a reference assembly is what I would like to do, but I don't know how. Abyss and fermi look like candidates. I don't have access to CLC bio, so that's not an option. Plus, I'd rather stick with open source :)

ADD REPLYlink written 4.1 years ago by Chris Cole620
1

Before NGS, reference assembly was more often referred to reference-guided assembly. The strategy usually required de novo assembly as a step and used a reference genome for orientation. Nowadays, by reference assembly, we typically mean mapping short reads to the reference genome and then running a SNP caller to call each base. Strictly speaking (at least in my view), this is not "assembly".

ADD REPLYlink written 4.1 years ago by lh328k

Oh, is that it? I was considering doing that, but thought it too simplistic...

I agree it isn't an assembly, but should suit my needs in this case.

ADD REPLYlink written 4.1 years ago by Chris Cole620
0
gravatar for Dan Gaston
4.1 years ago by
Dan Gaston6.6k
Canada
Dan Gaston6.6k wrote:

For whole genomes you can do short-read mapping the same as with exome data, in which case use whatever you prefer from exome experience. There are of course situations where you may want to try de novo assembly to look for large or complex structural variations. I recently visited the the BC Cancer Centre and, while they are obviously biased towards it, they have had a lot of success with AbySS. In their pipeline they mix short-read mapping and de novo assembly for analyses.

ADD COMMENTlink written 4.1 years ago by Dan Gaston6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour