Question

A Question About Hybrid Assembly

9

Entering edit mode

13.2 years ago

Lhl ▴ 760

Hi there,

I have some sequence (reads) data produced by both 454 and illumina technologies. And i assembled the 454 reads using Newbler, and illumina reads using Velvet respectively.

Now i want to combine all the data and do an more complete assembly.

Does anyone know how to do a hybrid assembly combining both data types (either start from assembling reads or assembling contigs) ?

Thanks a lot!

PS: In fact i have been thinking of combining both 454 and illumina contigs and remove the redundancy based on identity; or reciprocally blast the two sets of contigs agains each other to identify orthologous contigs and remove redundant contigs. However, i am not sure these are good strategies, i would love to know other options.

assembly illumina • 9.9k views

ADD COMMENT • link updated 13.0 years ago by Benm ▴ 710 • written 13.2 years ago by Lhl ▴ 760

Michael Kuhn · Answer 1 · 2011-06-06

There are few of assembly tools can handle the question of hybrid assembly of NGS data, it is also my puzzle, because the sequence type and error type are different, here is a great review to introduce the difference: Michael L. Metzker, Sequencing technologies — the next generation. Nat Rev Genet. 2010 Jan;11(1):31-46. Epub 2009 Dec 8. Review.

Maybe some software do it well, although they are not suitable for me: MIRA - Whole Genome Shotgun and EST Sequence Assembler for Sanger, 454 and Solexa / Illumina

ABySS - Assembly By Short Sequences - A de novo sequence assembler, "ABySS is a de novo sequence assembler that is designed for very short reads.

ALLPATHS - a whole genome shotgun assembler that can generate high quality genome assemblies using short reads such as those produced by the new generation of sequencers.

ALLPATHS-LG - a update version of ALLPATHS. It works on both small and large (mammalian size) genomes.

Or you can assemble them using Newbler for 454 reads and Velvet for Illumina reads as you did, then use PHRAP(Phrap is a program for assembling shotgun DNA sequence data, suitable for sanger and 454 reads, it used overlap-layout-consensus algorithm), CPA3/PCAP (CAP3 is for small-scale assembly of EST sequences with or without quality values;PCAP is for large-scale assembly of genomic sequences with quality values and with or without forward-reverse read pairs) and Euler(Euler is a new approach to fragment assembly that abandons the classical "overlap - layout - consensus" paradigm that is used in all currently available assembly tools.) to combine the contigs, if you wan to construct scaffolds, you can try SSPACE(Tools for scaffolding pre-assembled contigs), PE-Assembler(PE-Assembler: de novo assembler using short paired-end reads) etc.

For this question, I also have an article recommend to you: Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010 Jun;95(6):315-27. Epub 2010 Mar 6.

score 4 · Answer 2 · 2011-06-06

The problem is that the 454 and Illumina platforms yield different types of data each with inherent problems. No hybrid assembler exists that takes into account the different error types that comes with the respective data types. I suggest to shred your contigs from Velvet into artificial 454 reads and feed these to Newbler along with the original 454 data.

I have successfully used this approach before and written a Biopieces for the shredding:

http://code.google.com/p/biopieces/wiki/shred_seq

It is of cause also possible to shred Newbler contigs and feed to Velvet - I havn't tried this since my gut feeling tells me that Newbler will do the best job. (Also shred_seq currently don't produces paired-end reads -> noone requested this).

Alternatively, IDBA is supposed to be able to do hybrid assembles, but in my hands it always segfaults.

Also, Ray is supposed to be able to do hybrid assemblies - but more interestingly - also scaffolding. I havn't tested this, but I think you can feed Newbler contigs and Illumina reads to Ray.

score 4 · Answer 3 · 2011-06-06

FWIW, I've gotten best result using Newbler on 454 data, and then using SSPACE to build scaffolds from Illumina reads. I measure quality by aligning the illumina reads to the result, and counting the fraction of matched reads and mathced pairs, and also by counting EST matches and a handful of fosmid ends.

Runner-up method is CLC, which works decently on Illumina, but seems to be inferior to Newbler on 454 data. Celera also seems to be inferior to Newbler in my attempts at using it, and doesn't deal well with the amounts of Illumina data. I've not been able to get anything useful out of Velvet or SOAPdenovo, in spite of frequent praise and successful projects.

Quite likely, the optimal strategy depends on the types and amounts of data, and the characteristics of the genome you're trying to assemlble.

Bottom line is, try a variety of software, and make sure you measure the quality with whatever means you have - and that means something beyond N50.

score 2 · Answer 4 · 2011-06-06

2

Entering edit mode

13.2 years ago

2184687-1231-83- ★ 5.1k

You could try assembling with 454, then adding the Illumina reads on top, then closing the gaps with something like IMAGE:
http://genomebiology.com/content/11/4/R41

ADD COMMENT • link 13.2 years ago by 2184687-1231-83- ★ 5.1k

1

Entering edit mode

Thanks avilella, this is in fact a good suggestion. But it is a pity that in my case i got my draft genome using illumina and the assembly of 454 data produced much fewer contigs than the illumina counterpart.

ADD REPLY • link 13.2 years ago by Lhl ▴ 760

score 2 · Answer 5 · 2011-06-06

2

Entering edit mode

13.2 years ago

Marina Manrique ★ 1.3k

For a hybrid assembly I'd do a de novo assembly with MIRA (http://www.chevreux.org/projects_mira.html).

I think that's commonly used in hybrid assemblies but I'm not completely sure

ADD COMMENT • link 13.2 years ago by Marina Manrique ★ 1.3k

0

Entering edit mode

MIRA currently struggles with hybrid assemblies with lots of Illumina data. One might try with a limited number of reads (and lots of memory!).