Question

De novo genome assembly of haploid genome

1

Entering edit mode

8.4 years ago

Sarthok ▴ 70

I am planning to conduct de novo assembly of a bumblebee species and it is haploid. I was wondering what assembler would be best for it? I am considering platanus, megahit and hapsembler at this moment.

Thanks for your help!

genome Assembly • 2.7k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 8.4 years ago by Sarthok ▴ 70

0

Entering edit mode

What about your data? Is it Illumina, PacBio, Nanopore etc.? What are the insert sizes if you have Illumina? This will inform the choice of assembler

ADD REPLY • link 8.4 years ago by Philipp Bayer 8.7k

0

Entering edit mode

The data is from Illumina Hi seq, it's paired end and the insert size is 150 bp.

ADD REPLY • link 8.4 years ago by Sarthok ▴ 70

score 4 · Answer 1 · 2016-04-26

4

Entering edit mode

8.4 years ago

Rohit ★ 1.5k

The assembly depends on research question. Why would you try a denovo assembly with just a single illumina library when you already have such a good genome as suggested by Philipp? If you are trying a reference-based assembly MIRA would be a good start.

The whole assembly pipeline would depend on what kind of data you have - coverage and quality are the prime factors to consider. Since you have 150 bp illumina, you would have to go for de-brujin assemblers that can handle multiple libraries. Paired-end overlap would be the first thing to do since your insert size is low, this should be supplied as another single-end library. All you would get in the end would be various many contigs broken at repetitive or high complexity regions.

Without multiple libraries or sequencing technologies, you would get highly fragmented assemblies - Is this what you need?

ADD COMMENT • link 8.4 years ago by Rohit ★ 1.5k

0

Entering edit mode

Hi Rohit,

We have done a reference based assembly (aligning with the published genome of a closely related species- the paper Philipp mentioned, Bombus impatiens) but we are now are trying to identify some (possible) novel variation in a specific part of the genome (300 KB region) as that's why we want to do a de novo assembly. I would not need the whole assembly data to be very good, I actually need the piece we are interested in (around 300 KB) in order to check for indel/ novel variation which we might have missed when we have done our reference based assembly. But not sure what would be a good assembler for de novo assembly of paired end library (insert size is 150 bp)? Please let me know if you have any suggestion.

ADD REPLY • link 8.3 years ago by Sarthok ▴ 70

0

Entering edit mode

Hi Sarthok

First you would have to look if there is a break-point at the 300kb region you are interested in, usually mapping and then break-point detection tools are good at this, omics tools would be a good place to start -

http://omictools.com/whole-genome-resequencing-category

Since you have paired-end data and small insert size just start with paired-end read merging, I usually use FLASH for this. Then try the MIRA assembler, since it is a haploid assembly with <400MB genome, it should be pretty straight forward, not to forget that MIRA has a pretty impressive mailing-list if you run into trouble. IDBA-UD also does a good job in assembling along with SOAP-denovo but mis-assemblies would be something to watch out for with any de-brujin assembler.

ADD REPLY • link 8.3 years ago by Rohit ★ 1.5k

score 2 · Answer 2 · 2016-04-26

2

Entering edit mode

8.4 years ago

Philipp Bayer 8.7k

If your Illumina HiSeq read length is 250 bp I'd recommend DISCOVAR: https://www.broadinstitute.org/software/discovar/blog/?page_id=23

Since you don't seem to have mate-paired etc. libraries I wouldn't expect the best results with other assemblers.

Have you seen this recent bumblebee genome paper? They used Newbler and SOAPDENOVO https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0623-3

ADD COMMENT • link 8.4 years ago by Philipp Bayer 8.7k

1

Entering edit mode

Hi Philipp,

We have only paired end library data with 150 bp insert size. The bumblebee genome people had 454 libraries and mate- pair so I guess our choice of assembler would vary from them.

ADD REPLY • link 8.3 years ago by Sarthok ▴ 70

score 1 · Answer 3 · 2016-06-01

1

Entering edit mode

8.3 years ago

Sarthok ▴ 70

I went with MIRA and experimented with several parameters. I got a assembly which had N50 value of 50KB and blast results provided the contigs for the region I am interested in. Thanks all for your help!

ADD COMMENT • link 8.3 years ago by Sarthok ▴ 70

0

Entering edit mode

can you tell me which parameters you used ?

ADD REPLY • link 7.9 years ago by Picasa ▴ 650

0

Entering edit mode

I have gone through the MIRA manual (which is extremely helpful) to write and tune my parameters. It completely depends on your data type (read type, template size etc) and what kind of assembly you expect to generate (draft/accurate, de novo/ reference based) . Here is one of the parameters I used. But it could be very different for your one base on your data and assembly type.

parameters = COMMON_SETTINGS \ -GENERAL:number_of_threads=20 \ -NW:cnfs=warn \ -NW:cmrnl=warn \ SOLEXA_SETTINGS \ -CL:pec job = genome,denovo,accurate readgroup = DataIlluminaPairedLib data = /storage/foo_R1.fastq /storage/foo_R2.fastq technology = solexa template_size = 350 700 autorefine segment_placement = ---> <---

ADD REPLY • link 7.8 years ago by Sarthok ▴ 70