Question: De novo genome assembly strategy
0
gravatar for joneill4x
3.4 years ago by
joneill4x40
Canada
joneill4x40 wrote:

Assembling a genome de novo.  I have:

10X coverage with PAC-BIO reads

100X coverage with Illumina short reads (150 bp paired-end reads) 

20X coverage with long MiSeq reads (max length 800 bp)

Given what I have to work with, what would be the best strategy to assemble the genome and why?

Thank you,

Joe

edit - genome size ~ 1Gb

sequencing assembly genome • 2.6k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by joneill4x40
2

You should specify the genome type. Some tools will not be able to work on big genomes.

 

 

ADD REPLYlink written 3.4 years ago by Juke-342.1k

We have similar sets of data and I was wondering what you have decided to use at the end? Will also appreciate if you tell about your experience. Thanks

ADD REPLYlink written 2.9 years ago by s-writes0

I ended up using DBG2OLC

What lead me there: https://github.com/PacificBioscience...Bio-Long-Reads

The publication: http://arxiv.org/ftp/arxiv/papers/1410/1410.2801.pdf

The code: http://sourceforge.net/projects/dbg2olc/

I'm quite pleased with the results of DBG2OLC.

I corresponded with the authors, managed to closely replicate the results from their paper, and made some pretty decent draft assemblies of my own with minimal data. Fast performance and good results.

ADD REPLYlink written 2.9 years ago by joneill4x40
3
gravatar for Adrian Pelin
3.4 years ago by
Adrian Pelin2.3k
Canada
Adrian Pelin2.3k wrote:

SPAdes should provide very nice results for your dataset. It will assemble your 100x using a multi k-mer approach, then it will resolve some repeats using your long MiSeq reads and it will scaffold additionally using PacBio.

http://bioinf.spbau.ru/spades

So you can use their suggested guidelines for 150bp reads:

spades.py -k 21,33,55,77 --careful <your reads> -o spades_output

You can specify pacbio as: --pacbio

Your 100x as: --pe1-1 and --pe2-1

and your single end MiSeq as --s2

ADD COMMENTlink written 3.4 years ago by Adrian Pelin2.3k

A nice tool. But it will work only for smaller genomes.

ADD REPLYlink written 3.4 years ago by Juke-342.1k

I have used it up to 150mb. Then again the OP did not mention what the genome size is.

ADD REPLYlink written 3.4 years ago by Adrian Pelin2.3k

Thanks Adrian.  Using SPAdes was my first thought too.  However, my genome size is large, ~ 1GB, so I don't think I can use it.   

ADD REPLYlink written 3.4 years ago by joneill4x40

I found SPAdes and dipSPAdes to run extremely slow when using PacBio reads as input.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by joneill4x40
1
gravatar for Juke-34
3.4 years ago by
Juke-342.1k
Sweden
Juke-342.1k wrote:

Allpaths-LG can be a solution, it will perform the assembly from illumina short reads and then a scaffolding using the PacBio data.

For illumina reads, it needs a high coverage (100x), so for your case it's fine, but in other hand it needs very specific libraries (3 kbp matepair ?). You should check.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Juke-342.1k

Thanks Juke.

ADD REPLYlink written 3.4 years ago by joneill4x40

IIRC ALLPATHS-LG requires overlapping PE and one short mate-pair library.  So it may not work if the above libraries don't fit this specification. 

ADD REPLYlink written 3.4 years ago by Chris Fields2.1k
1
gravatar for Antonio R. Franco
3.4 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

ALLPATHS‐LG requires a minimum of 2 paired‐end libraries – one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap – for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage

EDIT: Copied from the manual

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Antonio R. Franco4.0k
1
gravatar for Juke-34
3.4 years ago by
Juke-342.1k
Sweden
Juke-342.1k wrote:

You also can use MaSuRCA mega-reads.

Masurca in general gives relatively good results.

It is one of the rare real hybrid assembler (De Bruijn/OLC)

ADD COMMENTlink written 3.4 years ago by Juke-342.1k
1

Thanks Juke.  However, I don't think I should use it for my task because "We note that the modified version of CABOG 6.1 used in MaSuRCA is not capable of supporting the long high-error-rate reads generated by the PacBio technology."

ADD REPLYlink written 3.4 years ago by joneill4x40
0
gravatar for joneill4x
3.4 years ago by
joneill4x40
Canada
joneill4x40 wrote:

*Deleted   

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by joneill4x40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour