Question: How I can increase the length of contig in denovo assembly
1
gravatar for ebrahimiet
3.2 years ago by
ebrahimiet40
ebrahimiet40 wrote:

Hi all,

I am performing de novo assembly of NDV virus (15 kb negative RNA-type genome) by Illumina paired end 200 bp reads. How I can increase length of assembled contigs?

thanks

assembly • 1.7k views
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by ebrahimiet40
1

A tiny genome like that should be easily assembled with PE 200 bp reads. You may actually have a problem of having too much data so you would need to sub-sample. How much data do you have (and it is all for this virus)?

You may want to give tadpole.sh from BBMap a try.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax73k
1

I tried tadpole a few months ago on GAGE-B. It generated highly fragmented assembly because it seems not doing any graph pruning. I am not sure it is a good choice for OP.

ADD REPLYlink written 3.2 years ago by lh331k
2

Tadpole produces much more fragmented assemblies than, say, SPAdes on most datasets, such as bacteria and more complex organisms. So I would never expect the current version to outperform SPAdes on a bacterial benchmark in terms of continuity. But for whatever reason, it has produced much better assemblies for some viruses, in situations where SPAdes produces a very poor assembly.

ADD REPLYlink written 3.2 years ago by Brian Bushnell16k

Have you tried to tune spades or feed tadpole-corrected reads to it? Lacking graph pruning still bugs me. If there is heterogeneity between strains, how will tadpole deal with that? It seems to me that a right combination should be an aggressive error corrector robust to ultra-high depth and an assembler capable of sophisticated graph cleaning.

ADD REPLYlink written 3.2 years ago by lh331k

I think the problems were a result of SPAdes making multiple duplicate copies of polymorphic regions of the viruses where it thought there were repeats. The assemblies ended up many times larger than expected. This persisted despite attempts at both error-correction (using Tadpole) and normalization, so I assume it is due to the interplay of graph-processing heuristics and a high viral polymorphism rate, rather than errors. I did not try tuning SPAdes' parameters, though, as I have not found in the past that I was able to achieve better assemblies by doing so. I agree that sophisticated graph operations should result in a better assembly, as they do for bacteria, but such operations are always based on assumptions, and it appears the assumptions did not fit these viruses very well.

ADD REPLYlink written 3.2 years ago by Brian Bushnell16k
1

That was the reason I had put it in the comments since I was not sure if it would work.

@ebrahimiet has another post that I guess is related to this. One problem could be that there is an excess of data in this case considering the small genome and 200 bp PE illumina reads.

ADD REPLYlink written 3.2 years ago by genomax73k

I am using CLC Genomics Workbench

ADD REPLYlink written 3.2 years ago by ebrahimiet40
0
gravatar for Sej Modha
3.2 years ago by
Sej Modha4.5k
Glasgow, UK
Sej Modha4.5k wrote:

We use SPAdes for virus genome assemblies with recommended k-mer values and it produces really good assemblies.

ADD COMMENTlink written 3.2 years ago by Sej Modha4.5k
0
gravatar for lh3
3.2 years ago by
lh331k
United States
lh331k wrote:

Contigs can be short for one of the two reasons: 1) a contig connects to too few contigs; 2) a contig connects to too many contigs. The first thing is to check which is the case. For your data, it is more likely that 2) is happening when the assembler is picking up strain differences. Then you need an assembler that can aggressively prune error/variant-containing subgraphs. What assembler are you using? SPAdes is always a good start for small genomes. Velvet and my fermi-lite might be worth trying.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2000 users visited in the last hour