Assessing The Quality Of De Novo Assembled Data
10
25
Entering edit mode
10.5 years ago
Prakki Rama ★ 2.6k

Are there any other ways of assessing the quality of assembled data obtained from different assemblers apart from metrics like N50 and assembly size?

I know few like,

• I can blast the contigs and check the % of the blast hits.
• verifying the % of the reads mapped to contigs of different assemblies.

Any other ideas are appreciated.

assembly next-gen • 20k views
20
Entering edit mode
10.0 years ago
Nikolay Vyahhi ★ 1.3k

QUAST (QUality ASsesment Tool for Genome Assembly) can be used to assess the quality of genome assemblies (both de novo reference based):

0
Entering edit mode
0
Entering edit mode

I also highly recommend QUAST for these tasks

13
Entering edit mode
10.5 years ago
Markf ▴ 290

Not to beat my own drum (too much) - but - I've written a tool that can be useful for this. The idea is that you map all raw reads back to the assembled genome and then assess what read pairs map, and, more importantly, which map, but at an unexpected distance. The tools takes a BAM input file, processes it, and then allows you to generate plots.

cheers Mark

0
Entering edit mode

I was just looking for a coverage-plotting library! Great coincidence, thanks!!!

0
Entering edit mode

if you need help/advise - let me know. If you need fixes/features - you can add them to the issue list in github

0
Entering edit mode

Interesting. But what if paired-end are already mated?

0
Entering edit mode

What do you mean by "already mated"?

If you align paired-end reads to your assembly, the insert-size shouldn't be too large or too small, if the size is too large then there's an indication that your assembly includes regions that do not exist, if the size is too small then there's an indication that your assembly misses a region. If a region in the assembly is not bridged by paired reads then that's an indication that the region doesn't exist in reality.

0
Entering edit mode

I meant (pre)-assembly of the 2 reads that belong to each pair (if read/insert sizes combination allows) , like this kind of tools: http://genomics.jhu.edu/software/FLASH/index.shtml does. In that case, relying on a tool that study the mapping of the pairs would be useless.

5
Entering edit mode
9.8 years ago
Lee Katz ★ 3.1k

Old topic, but this was just published. I'm curious how well it performs and will hopefully be testing it myself this week or soon (whenever time permits)

http://www.ncbi.nlm.nih.gov/pubmed/23303509

Abstract
MOTIVATION:
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
RESULTS:
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

5
Entering edit mode
7.3 years ago
Prakki Rama ★ 2.6k

One more to the list: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (in short BUSCO). It replaces discontinued CEGMA.

4
Entering edit mode
10.0 years ago

I don't have experience with any tools that estimate quality based on re-mapping reads to the de novo assembled sequence, and I'll have to check some of these out. I typically use the following metrics to compare the relative quality of my genome assemblies.

• N50 and N90
• number of contigs or scaffolds
• length of the longest contig or scaffold
• combined length of all contigs or scaffolds
• % CEGs (conserved core eukaryotic genes) mapped

For this last one, I use the CEGMA method[[1][1]] to identify genes that are highly conserved among all eukaryotes (implementation available at http://korflab.ucdavis.edu/datasets/cegma). The more of these conserved genes CEGMA is able to identify, the more confidence I have in the quality of the assembly and my ability to accurately annotate other genes in that genome.

0
Entering edit mode

Can you please elaborate your answer as I would like to know how you make inference by comparing. I mean how do you judge if its good alignment or poor. Even a small effort would be appreciated as I am trying to avoid tools.

3
Entering edit mode
10.0 years ago

In addition to other mentioned, I think that an ORF prediction step can give you a strong and fast comparative insight to compare several assemblies.

2
Entering edit mode
7.1 years ago
Prakki Rama ★ 2.6k

One can also assess number of misassembly errors in the genome using tools like REAPR and misSEQuel. That would also give nice gauge how good is the assembled genome.

2
Entering edit mode
4.3 years ago
Corentin ▴ 490

This is an old topic but here is a list of the tool I currently use:

Moreover, the very interesting paper for the assemblathon 2 (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10) describes how they assessed the different assemblies.

1
Entering edit mode

I like the Kmer approach; as an aside it can be interesting to see how much intersect() you get in Kmer spectrum between paired ends and final assemblies. Comparing the distance between PE/assembler1 and PE/assembler2 or between contigs from assembly1 and assembly2 can be quite interesting...

0
Entering edit mode

Yes, it is a very useful approach when no reference is available. It can also be used to estimate the level of mis-assemblies, by calculating the number of distinct k-mers only found in the assembly (and not in the reads).

1
Entering edit mode
10.2 years ago
Ketil 4.1k

I also wrote a pipeline to assess de-novo assemblies. It's not particularly strong in the plotting department, but it will use a variety of data (454, Illumina, DNAseq, RNAseq, ESTs, proteomes, etc) and calculate a bunch of numbers - in addition to internal metrics like N50 sizes and nucleotide counts - that lets you compare your candidate drafts. More info on http://blog.malde.org/posts/assembly-evaluation.html

0
Entering edit mode

The figures are very appealing. To install haskell and dependencies i had to sweat my blood without success. Finally, I am unsuccesful in using your pipeline.

0
Entering edit mode

I think I've found all main dependencies in conda repos (you can search them on anaconda.org) so it should be as simple as few conda commands. And if you don't use conda already (esp with Bioconda channel), you all should start right now :p

PS: I also found haskell in brew, but I dunno how to easily search the other packages (I don't have brew on my system to not collide PATHs with conda, so I can't just "brew search")

0
Entering edit mode
4.3 years ago
harish ▴ 430

Also a couple of other parameters to judge on are:

1. Mapping of a closer species or your own RNAseq data.
2. Duplicated contigs in your assembly. I have seen this in case of multiple genome where the unique genomic content is very low.
3. If you can predict ORFs, then one of the better approaches is to annotate the same against UniProt or InterProScan to see how many ORFs are getting annotated. This should be pretty close to the closest organism of your choice. Generally speaking most plants have about 25-30000 genes. Most bacteria have around 1000 genes per MB.