Assessing The Quality Of De Novo Assembled Data
10
25
Entering edit mode
11.8 years ago
Prakki Rama ★ 2.7k

Are there any other ways of assessing the quality of assembled data obtained from different assemblers apart from metrics like N50 and assembly size?

I know few like,

  • I can blast the contigs and check the % of the blast hits.
  • verifying the % of the reads mapped to contigs of different assemblies.

Any other ideas are appreciated.

assembly next-gen • 23k views
ADD COMMENT
20
Entering edit mode
11.4 years ago
Nikolay Vyahhi ★ 1.3k

QUAST (QUality ASsesment Tool for Genome Assembly) can be used to assess the quality of genome assemblies (both de novo reference based):

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I also highly recommend QUAST for these tasks

ADD REPLY
13
Entering edit mode
11.8 years ago
Markf ▴ 290

Not to beat my own drum (too much) - but - I've written a tool that can be useful for this. The idea is that you map all raw reads back to the assembled genome and then assess what read pairs map, and, more importantly, which map, but at an unexpected distance. The tools takes a BAM input file, processes it, and then allows you to generate plots.

See: https://github.com/mfiers/hagfish

cheers Mark

ADD COMMENT
0
Entering edit mode

I was just looking for a coverage-plotting library! Great coincidence, thanks!!!

ADD REPLY
0
Entering edit mode

if you need help/advise - let me know. If you need fixes/features - you can add them to the issue list in github

ADD REPLY
0
Entering edit mode

Interesting. But what if paired-end are already mated?

ADD REPLY
0
Entering edit mode

What do you mean by "already mated"?

If you align paired-end reads to your assembly, the insert-size shouldn't be too large or too small, if the size is too large then there's an indication that your assembly includes regions that do not exist, if the size is too small then there's an indication that your assembly misses a region. If a region in the assembly is not bridged by paired reads then that's an indication that the region doesn't exist in reality.

ADD REPLY
0
Entering edit mode

I meant (pre)-assembly of the 2 reads that belong to each pair (if read/insert sizes combination allows) , like this kind of tools: http://genomics.jhu.edu/software/FLASH/index.shtml does. In that case, relying on a tool that study the mapping of the pairs would be useless.

ADD REPLY
5
Entering edit mode
11.2 years ago
Lee Katz ★ 3.1k

Old topic, but this was just published. I'm curious how well it performs and will hopefully be testing it myself this week or soon (whenever time permits)

http://www.ncbi.nlm.nih.gov/pubmed/23303509

http://sc932.github.com/ALE/about.html

Abstract
MOTIVATION:
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
RESULTS:
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
ADD COMMENT
5
Entering edit mode
8.6 years ago
Prakki Rama ★ 2.7k

One more to the list: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (in short BUSCO). It replaces discontinued CEGMA.

ADD COMMENT
4
Entering edit mode
11.4 years ago

I don't have experience with any tools that estimate quality based on re-mapping reads to the de novo assembled sequence, and I'll have to check some of these out. I typically use the following metrics to compare the relative quality of my genome assemblies.

  • N50 and N90
  • number of contigs or scaffolds
  • length of the longest contig or scaffold
  • combined length of all contigs or scaffolds
  • % CEGs (conserved core eukaryotic genes) mapped

For this last one, I use the CEGMA method[[1][1]] to identify genes that are highly conserved among all eukaryotes (implementation available at http://korflab.ucdavis.edu/datasets/cegma). The more of these conserved genes CEGMA is able to identify, the more confidence I have in the quality of the assembly and my ability to accurately annotate other genes in that genome.


  1. Parra G, Bradnam K, Korf I. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23: 1061-1067, doi:10.1093/bioinformatics/btm071.
ADD COMMENT
0
Entering edit mode

Can you please elaborate your answer as I would like to know how you make inference by comparing. I mean how do you judge if its good alignment or poor. Even a small effort would be appreciated as I am trying to avoid tools.

ADD REPLY
3
Entering edit mode
11.4 years ago

In addition to other mentioned, I think that an ORF prediction step can give you a strong and fast comparative insight to compare several assemblies.

ADD COMMENT
2
Entering edit mode
8.5 years ago
Prakki Rama ★ 2.7k

One can also assess number of misassembly errors in the genome using tools like REAPR and misSEQuel. That would also give nice gauge how good is the assembled genome.

ADD COMMENT
2
Entering edit mode
5.6 years ago
Corentin ▴ 600

This is an old topic but here is a list of the tool I currently use:

Moreover, the very interesting paper for the assemblathon 2 (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10) describes how they assessed the different assemblies.

ADD COMMENT
1
Entering edit mode

I like the Kmer approach; as an aside it can be interesting to see how much intersect() you get in Kmer spectrum between paired ends and final assemblies. Comparing the distance between PE/assembler1 and PE/assembler2 or between contigs from assembly1 and assembly2 can be quite interesting...

ADD REPLY
0
Entering edit mode

Yes, it is a very useful approach when no reference is available. It can also be used to estimate the level of mis-assemblies, by calculating the number of distinct k-mers only found in the assembly (and not in the reads).

ADD REPLY
1
Entering edit mode
11.6 years ago
Ketil 4.1k

I also wrote a pipeline to assess de-novo assemblies. It's not particularly strong in the plotting department, but it will use a variety of data (454, Illumina, DNAseq, RNAseq, ESTs, proteomes, etc) and calculate a bunch of numbers - in addition to internal metrics like N50 sizes and nucleotide counts - that lets you compare your candidate drafts. More info on http://blog.malde.org/posts/assembly-evaluation.html

ADD COMMENT
0
Entering edit mode

The figures are very appealing. To install haskell and dependencies i had to sweat my blood without success. Finally, I am unsuccesful in using your pipeline.

ADD REPLY
0
Entering edit mode

I think I've found all main dependencies in conda repos (you can search them on anaconda.org) so it should be as simple as few conda commands. And if you don't use conda already (esp with Bioconda channel), you all should start right now :p

PS: I also found haskell in brew, but I dunno how to easily search the other packages (I don't have brew on my system to not collide PATHs with conda, so I can't just "brew search")

ADD REPLY
0
Entering edit mode
5.6 years ago
harish ▴ 450

Also a couple of other parameters to judge on are:

  1. Mapping of a closer species or your own RNAseq data.
  2. Duplicated contigs in your assembly. I have seen this in case of multiple genome where the unique genomic content is very low.
  3. If you can predict ORFs, then one of the better approaches is to annotate the same against UniProt or InterProScan to see how many ORFs are getting annotated. This should be pretty close to the closest organism of your choice. Generally speaking most plants have about 25-30000 genes. Most bacteria have around 1000 genes per MB.
ADD COMMENT

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6