Question: Assessing The Quality Of De Novo Assembled Data
18
gravatar for Prakki Rama
7.4 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

Are there any other ways of assessing the quality of assembled data obtained from different assemblers apart from metrics like N50 and assembly size?

I know few like,

  • I can blast the contigs and check the % of the blast hits.
  • verifying the % of the reads mapped to contigs of different assemblies.

Any other ideas are appreciated.

assembly next-gen • 16k views
ADD COMMENTlink modified 14 months ago by Corentin430 • written 7.4 years ago by Prakki Rama2.3k
19
gravatar for Nikolay Vyahhi
6.9 years ago by
Nikolay Vyahhi1.2k
St. Petersburg, Russia
Nikolay Vyahhi1.2k wrote:

QUAST (QUality ASsesment Tool for Genome Assembly) can be used to assess the quality of genome assemblies (both de novo reference based):

ADD COMMENTlink modified 6.7 years ago • written 6.9 years ago by Nikolay Vyahhi1.2k

QUAST paper was published in Bioinformatics — http://bioinformatics.oxfordjournals.org/content/early/2013/02/18/bioinformatics.btt086.abstract

ADD REPLYlink written 6.8 years ago by Nikolay Vyahhi1.2k

I also highly recommend QUAST for these tasks

ADD REPLYlink written 6.0 years ago by Hayssam270
13
gravatar for Markf
7.4 years ago by
Markf290
New Zealand
Markf290 wrote:

Not to beat my own drum (too much) - but - I've written a tool that can be useful for this. The idea is that you map all raw reads back to the assembled genome and then assess what read pairs map, and, more importantly, which map, but at an unexpected distance. The tools takes a BAM input file, processes it, and then allows you to generate plots.

See: https://github.com/mfiers/hagfish

cheers Mark

ADD COMMENTlink written 7.4 years ago by Markf290

I was just looking for a coverage-plotting library! Great coincidence, thanks!!!

ADD REPLYlink modified 7.4 years ago by Neilfws48k • written 7.4 years ago by Philipp Bayer6.5k

if you need help/advise - let me know. If you need fixes/features - you can add them to the issue list in github

ADD REPLYlink written 7.4 years ago by Markf290

Interesting. But what if paired-end are already mated?

ADD REPLYlink written 6.9 years ago by Manu Prestat3.9k

What do you mean by "already mated"?

If you align paired-end reads to your assembly, the insert-size shouldn't be too large or too small, if the size is too large then there's an indication that your assembly includes regions that do not exist, if the size is too small then there's an indication that your assembly misses a region. If a region in the assembly is not bridged by paired reads then that's an indication that the region doesn't exist in reality.

ADD REPLYlink written 6.9 years ago by Philipp Bayer6.5k

I meant (pre)-assembly of the 2 reads that belong to each pair (if read/insert sizes combination allows) , like this kind of tools: http://genomics.jhu.edu/software/FLASH/index.shtml does. In that case, relying on a tool that study the mapping of the pairs would be useless.

ADD REPLYlink written 6.9 years ago by Manu Prestat3.9k
5
gravatar for Lee Katz
6.8 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

Old topic, but this was just published. I'm curious how well it performs and will hopefully be testing it myself this week or soon (whenever time permits)

http://www.ncbi.nlm.nih.gov/pubmed/23303509

http://sc932.github.com/ALE/about.html

Abstract
MOTIVATION:
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
RESULTS:
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
ADD COMMENTlink written 6.8 years ago by Lee Katz3.0k
5
gravatar for Prakki Rama
4.2 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

One more to the list: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (in short BUSCO). It replaces discontinued CEGMA.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Prakki Rama2.3k
4
gravatar for Daniel Standage
6.9 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I don't have experience with any tools that estimate quality based on re-mapping reads to the _de novo_ assembled sequence, and I'll have to check some of these out. I typically use the following metrics to compare the relative quality of my genome assemblies.

  • N50 _and_ N90
  • number of contigs or scaffolds
  • length of the longest contig or scaffold
  • combined length of all contigs or scaffolds
  • % CEGs (conserved core eukaryotic genes) mapped

For this last one, I use the CEGMA method[1] to identify genes that are highly conserved among all eukaryotes (implementation available at http://korflab.ucdavis.edu/datasets/cegma). The more of these conserved genes CEGMA is able to identify, the more confidence I have in the quality of the assembly and my ability to accurately annotate other genes in that genome.


  1. **Parra G, Bradnam K, Korf I**. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23: 1061-1067, doi:10.1093/bioinformatics/btm071.
ADD COMMENTlink written 6.9 years ago by Daniel Standage3.9k

Can you please elaborate your answer as I would like to know how you make inference by comparing. I mean how do you judge if its good alignment or poor. Even a small effort would be appreciated as I am trying to avoid tools.

ADD REPLYlink written 18 months ago by k.rajain12120
3
gravatar for Manu Prestat
6.9 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

In addition to other mentioned, I think that an ORF prediction step can give you a strong and fast comparative insight to compare several assemblies.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Manu Prestat3.9k
2
gravatar for Prakki Rama
4.1 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

One can also assess number of misassembly errors in the genome using tools like REAPR and misSEQuel. That would also give nice gauge how good is the assembled genome.

ADD COMMENTlink written 4.1 years ago by Prakki Rama2.3k
1
gravatar for Ketil
7.2 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

I also wrote a pipeline to assess de-novo assemblies. It's not particularly strong in the plotting department, but it will use a variety of data (454, Illumina, DNAseq, RNAseq, ESTs, proteomes, etc) and calculate a bunch of numbers - in addition to internal metrics like N50 sizes and nucleotide counts - that lets you compare your candidate drafts. More info on http://blog.malde.org/posts/assembly-evaluation.html

ADD COMMENTlink written 7.2 years ago by Ketil4.0k

The figures are very appealing. To install haskell and dependencies i had to sweat my blood without success. Finally, I am unsuccesful in using your pipeline.

ADD REPLYlink modified 4.2 years ago • written 4.5 years ago by Prakki Rama2.3k

I think I've found all main dependencies in conda repos (you can search them on anaconda.org) so it should be as simple as few conda commands. And if you don't use conda already (esp with Bioconda channel), you all should start right now :p

PS: I also found haskell in brew, but I dunno how to easily search the other packages (I don't have brew on my system to not collide PATHs with conda, so I can't just "brew search")

ADD REPLYlink written 11 months ago by cicindel10
1
gravatar for Corentin
14 months ago by
Corentin430
Corentin430 wrote:

This is an old topic but here is a list of the tool I currently use:

Moreover, the very interesting paper for the assemblathon 2 (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10) describes how they assessed the different assemblies.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Corentin430
1

I like the Kmer approach; as an aside it can be interesting to see how much intersect() you get in Kmer spectrum between paired ends and final assemblies. Comparing the distance between PE/assembler1 and PE/assembler2 or between contigs from assembly1 and assembly2 can be quite interesting...

ADD REPLYlink written 6 months ago by ctseto250

Yes, it is a very useful approach when no reference is available. It can also be used to estimate the level of mis-assemblies, by calculating the number of distinct k-mers only found in the assembly (and not in the reads).

ADD REPLYlink written 5 months ago by Corentin430
0
gravatar for harish
15 months ago by
harish230
harish230 wrote:

Also a couple of other parameters to judge on are:

  1. Mapping of a closer species or your own RNAseq data.
  2. Duplicated contigs in your assembly. I have seen this in case of multiple genome where the unique genomic content is very low.
  3. If you can predict ORFs, then one of the better approaches is to annotate the same against UniProt or InterProScan to see how many ORFs are getting annotated. This should be pretty close to the closest organism of your choice. Generally speaking most plants have about 25-30000 genes. Most bacteria have around 1000 genes per MB.
ADD COMMENTlink written 15 months ago by harish230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour