I'm a bioinformatician with lots of NGS experience, but mostly with RNAseq and exomes. I'm looking at an upcoming project which involves the assembly of one or more human genomes. I have no experience in assembly, so what would people suggest I try?
I realise de novo assembly is non-trivial with such a large genome, but what about a reference guided assembly? Can the typical tools help with that, i.e. abyss, velvet or cortex?
Also, what kind of hardware requirements would I need? The biggest box I have access to has 24 cores and 128GB RAM.
Minia will give you an accurate contigs assembly using very reasonable amounts of time (1 day) and memory (~ 6 GB). It is not a complete assembly pipe-line (paired-end and mate-pair information are not used). But you can run a scaffolder (such as SSPACE) after it.
Cortex doesn't do whole-genome consensus assembly - if that's what you want then Cortex is the wrong tool. If you want to find variants in this genome or between it and other samples or the reference, then Cortex can do the job, and I think you're better off using Cortex than doing a WG assembly and then mapping to it - unless you want to find very large (>10kb) heterozygous structural variants, where Cortex has low power, but an assembly might find things/hints (with a potentially high FDR, so you'd need to do a decent job of validating/studying/interpreting the results)
Abyss, or SGA or AllPaths-LG would be my tools of choice for standard WG human assembly. Or I might try fermi, but I'm embarrassed to admit I don't really know what kind of results fermi would give/how they would compare. Heng can comment more usefully on that.
The key is not to go out and look for the best assembler of all - nothing out there is the best at everything. Work out what you want to achieve with the assembly, and then go and assess the possible tools.
Fermi is not designed as a complete assembly package. It uses short-insert paired-end reads when assembling contigs, but it does not do scaffolding or use long-insert mate-pair reads. For contig assembly on the NA12878 data set, it is comparable to SGA and abyss in terms of N50 and misassembly rate (see preprint). With 128GB RAM, fermi should work with 45X coverage, at most 50X I would guess. SGA is more memory efficient. Another good assembler to try is SOAPdenovo2. When it is set up to use the sparse de Bruijn graph (a graph not keeping all k-mers), it can also assemble deep coverage in 128GB RAM.
Many thanks for these replies, they're very helpful.
The aim is to produce a genome for an individual or more - not so much discover anything specific - from a short read sequencing run. It'll be a couple of lanes of HiSeq, so ~20x.
More of a comment but an answer but I though it could help you anyway.
The amount of computational power you need depends on the sequencing depth you are going to use. There are assemblers which will need much more RAM than you have.
I remember reading in the LG-Allpaths manual, the time and RAM they needed for a human genome. Additionally some assemblers can not handle paired end data. My favorite assembler is CLC (unfortunatelly not free). If you are really getting serious there is no one from stoping you to try several assemblers and use the one producing the best output.
Additional tip: the most in important thing in genome assembly (and maybe in all bioinformatics stuff) is preprocessing.
You have to trim for quality, correct sequencing errors (e.g. tools like Quake), check for contamination and check if the insert size provided by the sequencing lab is correct.
To the assembly itself:
There is no reasion why you might do a de-novo assembly. People spend millions of dollar to come up with a human genome so use ist. A reference assembly might also be much less computational intensive.
I am interested in the 3rd-party evaluation of CLC on human data. I heard from my friends two years ago that CLC was overstating their performance that time. How about it now? Also, Jared published in the SGA paper that for de novo assembly, trimming leads to shorter N50. My experience is the same. The right strategy is to do quality-aware error correction as much as possible if the assembly algorithm itself cannot handle errors well. Most practical assemblers provide error correction tools (e.g. soapdenovo, sga, allpaths-lg, cortex and fermi; if I am right, none of them trim reads) or handle errors well (e.g. celera-assembler).
I agree a reference assembly is what I would like to do, but I don't know how. Abyss and fermi look like candidates. I don't have access to CLC bio, so that's not an option. Plus, I'd rather stick with open source :)
Before NGS, reference assembly was more often referred to reference-guided assembly. The strategy usually required de novo assembly as a step and used a reference genome for orientation. Nowadays, by reference assembly, we typically mean mapping short reads to the reference genome and then running a SNP caller to call each base. Strictly speaking (at least in my view), this is not "assembly".
I misread your statement, my apologies. So you are looking to do de novo assembly? That is getting to be a pretty niche application in human genomes these days. Usually people do something like hybrid mapping/assembly protocols if they aren;t just doing standard mapping.
Unless you are working with long reads (PacBio/Nanopore) I don't think genome assembly is beneficial and mapping would be preferential. But perhaps you have good reasons to go for assembly?
For whole genomes you can do short-read mapping the same as with exome data, in which case use whatever you prefer from exome experience. There are of course situations where you may want to try de novo assembly to look for large or complex structural variations. I recently visited the the BC Cancer Centre and, while they are obviously biased towards it, they have had a lot of success with AbySS. In their pipeline they mix short-read mapping and de novo assembly for analyses.
related: Genome assembly review papers