Question: How to handle multiple contig outputs from de novo assemblers when one contig is desired?
gravatar for Trombone Engineer
11 weeks ago by
Trombone Engineer0 wrote:

I am working with Oxford Nanopore Minion data for small genomes that I am trying to assemble with de novo assembly tools. For training, I have a few datasets with reference genomes and have been comparing various de novo assembly tools. So far I have the best performance from Unicycler, but have not been able to find much information on polishing or otherwise handling multiple separate contigs when one long contig is desired. Sometimes the same assembler tools will output 1 contig, and other times they will output many separate contigs - even though there is enough of an overlap to hypothetically connect these separate contigs.

I completed some genome polishing tutorials such as with NanoPolish, but realized that they may not do what I want: combining separate contigs into one draft genome sequence. What are the designated tools to accomplish this task? Should I expect to do it manually with a visualization or mapping tool? Is alignment or MSA helpful for this task?

Additionally, is there a reason why state of the art assembly tools are unable to complete these assemblies manually (into a single contig that is)? I do not believe I have any unsequenced stretches, since my genomes are so small.

ADD COMMENTlink modified 11 weeks ago by Mensur Dlakic9.0k • written 11 weeks ago by Trombone Engineer0

The vast majority of genome assemblies deposited to e.g. GenBank do not include complete chromosomes as continuous sequence. Is there some particular reason why contigs aren't good enough for you?

ADD REPLYlink written 11 weeks ago by 5heikki9.3k

If you have related genomes, potentially ref based scaffolding tools like this are useful.

ADD REPLYlink written 10 weeks ago by colindaven2.6k
gravatar for Mensur Dlakic
11 weeks ago by
Mensur Dlakic9.0k
Mensur Dlakic9.0k wrote:

There are many reasons why most genomes come out incomplete after the assembly. Some of them are: sequencing errors, sequence repeats, uneven coverage, inherent difficulty in cloning or amplifying certain genomic regions, sample contamination, poor technical handling, bad luck. By the way, deep coverage on its own is not enough, especially for NGS methods that use short reads.

Assembly programs are created to combine fragments in an intelligent way, which includes more than simply spotting an overlap between the fragments. The overlap needs to be long enough, without mismatches (especially in the middle part), and supported by a good number of reads. Even if you verified that one or more of these criteria are fulfilled, chances are that there isn't enough confidence to join the contigs reliably. You can inspect the assembly graphs if you wish to confirm or override assembler's decisions.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Mensur Dlakic9.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1898 users visited in the last hour