Question: De novo genome assembly of haploid genome
0
gravatar for Sarthok
3.0 years ago by
Sarthok60
Pennsylvania State University
Sarthok60 wrote:

I am planning to conduct de novo assembly of a bumblebee species and it is haploid. I was wondering what assembler would be best for it? I am considering platanus, megahit and hapsembler at this moment.

Thanks for your help!

assembly genome • 1.2k views
ADD COMMENTlink modified 2.6 years ago by Biostar ♦♦ 20 • written 3.0 years ago by Sarthok60

What about your data? Is it Illumina, PacBio, Nanopore etc.? What are the insert sizes if you have Illumina? This will inform the choice of assembler

ADD REPLYlink written 3.0 years ago by Philipp Bayer6.0k

The data is from Illumina Hi seq, it's paired end and the insert size is 150 bp.

ADD REPLYlink written 3.0 years ago by Sarthok60
3
gravatar for Rohit
3.0 years ago by
Rohit1.3k
California
Rohit1.3k wrote:

The assembly depends on research question. Why would you try a denovo assembly with just a single illumina library when you already have such a good genome as suggested by Philipp? If you are trying a reference-based assembly MIRA would be a good start.

The whole assembly pipeline would depend on what kind of data you have - coverage and quality are the prime factors to consider. Since you have 150 bp illumina, you would have to go for de-brujin assemblers that can handle multiple libraries. Paired-end overlap would be the first thing to do since your insert size is low, this should be supplied as another single-end library. All you would get in the end would be various many contigs broken at repetitive or high complexity regions.

Without multiple libraries or sequencing technologies, you would get highly fragmented assemblies - Is this what you need?

ADD COMMENTlink written 3.0 years ago by Rohit1.3k

Hi Rohit,

We have done a reference based assembly (aligning with the published genome of a closely related species- the paper Philipp mentioned, Bombus impatiens) but we are now are trying to identify some (possible) novel variation in a specific part of the genome (300 KB region) as that's why we want to do a de novo assembly. I would not need the whole assembly data to be very good, I actually need the piece we are interested in (around 300 KB) in order to check for indel/ novel variation which we might have missed when we have done our reference based assembly. But not sure what would be a good assembler for de novo assembly of paired end library (insert size is 150 bp)? Please let me know if you have any suggestion.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Sarthok60

Hi Sarthok

First you would have to look if there is a break-point at the 300kb region you are interested in, usually mapping and then break-point detection tools are good at this, omics tools would be a good place to start -

http://omictools.com/whole-genome-resequencing-category

Since you have paired-end data and small insert size just start with paired-end read merging, I usually use FLASH for this. Then try the MIRA assembler, since it is a haploid assembly with <400MB genome, it should be pretty straight forward, not to forget that MIRA has a pretty impressive mailing-list if you run into trouble. IDBA-UD also does a good job in assembling along with SOAP-denovo but mis-assemblies would be something to watch out for with any de-brujin assembler.

ADD REPLYlink written 3.0 years ago by Rohit1.3k
2
gravatar for Philipp Bayer
3.0 years ago by
Philipp Bayer6.0k
Australia/Perth/UWA
Philipp Bayer6.0k wrote:

If your Illumina HiSeq read length is 250 bp I'd recommend DISCOVAR: https://www.broadinstitute.org/software/discovar/blog/?page_id=23

Since you don't seem to have mate-paired etc. libraries I wouldn't expect the best results with other assemblers.

Have you seen this recent bumblebee genome paper? They used Newbler and SOAPDENOVO https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0623-3

ADD COMMENTlink written 3.0 years ago by Philipp Bayer6.0k
1

Hi Philipp,

We have only paired end library data with 150 bp insert size. The bumblebee genome people had 454 libraries and mate- pair so I guess our choice of assembler would vary from them.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Sarthok60
1
gravatar for Sarthok
2.9 years ago by
Sarthok60
Pennsylvania State University
Sarthok60 wrote:

I went with MIRA and experimented with several parameters. I got a assembly which had N50 value of 50KB and blast results provided the contigs for the region I am interested in. Thanks all for your help!

ADD COMMENTlink written 2.9 years ago by Sarthok60

can you tell me which parameters you used ?

ADD REPLYlink written 2.5 years ago by Picasa390

I have gone through the MIRA manual (which is extremely helpful) to write and tune my parameters. It completely depends on your data type (read type, template size etc) and what kind of assembly you expect to generate (draft/accurate, de novo/ reference based) . Here is one of the parameters I used. But it could be very different for your one base on your data and assembly type.

parameters = COMMON_SETTINGS \ -GENERAL:number_of_threads=20 \ -NW:cnfs=warn \ -NW:cmrnl=warn \ SOLEXA_SETTINGS \ -CL:pec job = genome,denovo,accurate readgroup = DataIlluminaPairedLib data = /storage/foo_R1.fastq /storage/foo_R2.fastq technology = solexa template_size = 350 700 autorefine segment_placement = ---> <---

ADD REPLYlink written 2.5 years ago by Sarthok60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 670 users visited in the last hour