Question: How To Assess The Quality Of An Assembly? (Is There No Magic Formula?)
20
gravatar for diltsjeri
5.7 years ago by
diltsjeri420
Chicago, IL
diltsjeri420 wrote:

Hi,

I'm having a difficult time finding a consensus method for assessing the quality of an assembly.

Are there "best" methods to use based on the organism type, technology, and sequence quality? I know N50 is a value I should use to assess assembly quality, but is this only metric?

Thanks.

assembly quality next-gen • 18k views
ADD COMMENTlink modified 11 weeks ago by alslonik30 • written 5.7 years ago by diltsjeri420
2

'Quality' can be a very subjective thing. The Assemblathons, as well as contests like GAGE and dnGASP, seem to indicate that assemblies can be high quality in a few areas of interest, but it is hard to make an assembly that excels in all aspects of quality. If you are only interested in one aspect of assembly quality, e.g. finding genes in a genome assembly, then it may not matter whether scaffolds are really long (e.g. > 10 Mbp), only that scaffolds mostly contain whole genes.

N50 can tell you something about the average length of scaffolds and/or contigs. It is meaningless to compare the N50 values of any two assemblies unless they are the same size. It is also possible to artificially raise N50 by deliberately excluding short contigs/scaffolds and/or increasing the padding of Ns within scaffolds. One of the figures we include in the Assemblathon 2 paper suggests that N50 can be a semi-useful predictor of assembly quality. Some of the most highly-ranked assemblies had high N50 values...but not all of them did, and some which had high N50 values did not rank as highly.

To give you a succinct, but somewhat disappointing, answer to your question, I would say:

There is no magic formula.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by kbradnam20

Lately I have been following the methods listed here:

  • BUSCO/CEGMA for checking the core genes
  • Map RNASeq reads and unigenes dervied from transcriptome assembly
  • Map Proteins from closely related species
  • Map constituent reads that were used to form the assembly and check their depth and mappability
  • Distribution of NGx (10,50,70,90 etc)
  • Distribution of contig lengths
  • Check presence of duplicate contigs and other contaminants (easiest way is to submit the genome to NCBI)
  • Bases constituting the assembly.
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by harishk020120
18
gravatar for zam.iqbal.genome
5.7 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

N50 is most definitely not the only thing to look at. How you should asses it basically depends on what you want to do with the assembly.

You could check out this paper recently submitted to the Arxiv

http://arxiv.org/pdf/1301.5406

"Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species"

Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)

and also the previous Assemblathon paper. Also check out papers by Steven Salzberg and Mihai Pop on this subject, plus the references within all of the above. There are many others which I can't think of off the top of my head, I'm sure others will suggest some

best Zam

ADD COMMENTlink written 5.7 years ago by zam.iqbal.genome1.7k
3

As you mentioned GAGE, I am actually concerned with this evaluation. For small genomes, the authors intentionally mix 50% of short-insert reads and 50% of long-insert reads by thinning the source data. When assembling, they largely treat the two types of reads the same apart from orientation and insert size. If the assembler does not consider the exceptionally high chimeric rate of long-insert reads, the performance will be very bad, as is shown in the table. However, in practice, short-insert reads are cheaper and of much better quality than long-insert. An better approach would be to sequence more short-insert reads, assemble them first and then only use long-insert to build scaffolds. As such, GAGE might only be evaluating a scenario that may not represent the best practice.

Assemblathon 1/2 is truly amazing which I like a lot.

ADD REPLYlink written 5.7 years ago by lh331k

Hi Heng. I didn't mention GAGE at all, I mentioned Steven Salzberg. I was thinking of papers like these

http://bioinformatics.oxfordjournals.org/content/21/24/4320.full http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0021400

http://books.google.co.uk/books?hl=en&lr=&id=UrKGLrmpRZAC&oi=fnd&pg=PA163&dq=info:vd-c54xEXwAJ:scholar.google.com&ots=tIw9P31XE9&sig=0L3wqezpwzlFJtcH28HEunu_ZHc&redir_esc=y#v=onepage&q&f=false

http://www.biomedcentral.com/content/pdf/gb-2008-9-3-r55.pdf

cheers

Zam

ADD REPLYlink written 5.7 years ago by zam.iqbal.genome1.7k

Yeah, their reviews are very good. Thanks for the clarification.

ADD REPLYlink written 5.7 years ago by lh331k
8
gravatar for Madelaine Gogol
5.7 years ago by
Madelaine Gogol5.0k
Kansas City
Madelaine Gogol5.0k wrote:

I like the paper answer above, but if you're just looking for some additional measuring sticks besides N50, you could also think about:

  • number of contigs
  • Length of longest/shortest contigs
  • Average length of contigs
  • Total length of all contigs
  • Length of 10/100/1000/10000 longest contigs
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Madelaine Gogol5.0k
8
gravatar for Manu Prestat
5.7 years ago by
Manu Prestat3.8k
Marseille, France
Manu Prestat3.8k wrote:

I would add: the number of annotations you can grab from your contigs or ORFs you can predict as "information content" estimates.

ADD COMMENTlink written 5.7 years ago by Manu Prestat3.8k
6
gravatar for earonesty
5.7 years ago by
earonesty200
United States
earonesty200 wrote:

I use a dup-mer-21 calculation to compare assemblies based on this conversaion:

http://www.homolog.us/blogs/2012/06/26/what-is-wrong-with-n50-how-can-we-make-it-better-part-ii/

Source code:

http://ea-utils.googlecode.com/svn/trunk/clipper/contig-stats

This lets you know if there is excessive chimerism ... a common error.

ADD COMMENTlink written 5.7 years ago by earonesty200
1

The article correctly points out that evaluating N50 only is frequently misleading, but the last paragraph is questionable. When there is ambiguity about whether A should be connected to B or to C, the right decision is not to perform any joining. If we force a join, we will get longer N50 at the cost of high error probability at the junction. An aggressive assembler will get longer N50 but more misassemblies in that case.

ADD REPLYlink written 5.7 years ago by lh331k

Which is what the dup-mer-21 will detect... overaggressive assemblers. You should see the same kmer represented in multiple locations when the assembler is more aggressively calling connections in its graph than it should.

It's easy to produce a single contig. It's hard to get it right.

ADD REPLYlink modified 5.0 years ago • written 5.5 years ago by earonesty200

@earonesty: Could i please know how to intrpret the dup-mer-cnt, dup-pct-21 when comparing assemblies? Should they be high or low?

ADD REPLYlink written 4.7 years ago by Prakki Rama2.1k

They should be "comparable to expected".     In other words...you should benchmark it to an existing quality assembly.   Some k-mer duplication is, of course, expected.   What the "correct" number is varies from organism to organism.   As a rule, I would expect longer genome to have more.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by earonesty200
6
gravatar for Rayan Chikhi
5.7 years ago by
Rayan Chikhi1.4k
France, Lille, CNRS
Rayan Chikhi1.4k wrote:

QUAST and FRCurve are two recent tools that should definitely be considered when evaluating assemblies.

QUAST computes a comprehensive set of classical metrics. It can reproduce the GAGE benchmark.

FRCurve computes newer metrics related to correctness.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Rayan Chikhi1.4k
3
gravatar for SES
5.7 years ago by
SES8.1k
Vancouver, BC
SES8.1k wrote:

Regardless of your biological question, I think looking at length statistics alone can be very misleading and uninformative because 1) the percentage of Ns in scaffolds may be very high and 2) there is always some level of contamination (from organelles, but also other species, possibly) in draft shotgun assemblies, in my experience. How you define "quality" is important to your assessment of the assembly, but the common goal is to try and represent the actual genomic sequence of an organism, so some things to check are:

  • Sequence content of contigs/scaffolds.
  • Levels of contamination (aside from sequence contamination, there are also assembly artifacts to be aware of, as others mentioned).
  • Gene content/accuracy.

The last two points can be assessed by looking at the reference genome or gene models, respectively, of your species or a closely related species. There are many recent papers on comparing genome assemblies so I won't list any paper or tools (too easy to google), but I will mention a method for inferring the gene content. CEGMA is a set of conserved genes in eukaryotes and may be biologically informative, especially if your organism is a non-model species and you have no transcriptome or even closely related species for comparison.

ADD COMMENTlink written 5.7 years ago by SES8.1k
0
gravatar for Prakki Rama
4.9 years ago by
Prakki Rama2.1k
Singapore
Prakki Rama2.1k wrote:

you can also check this Assessing The Quality Of De Novo Assembled Data

ADD COMMENTlink written 4.9 years ago by Prakki Rama2.1k
0
gravatar for Lakshman Teja
4 months ago by
BANGALORE, INDIA
Lakshman Teja0 wrote:

you can use Quast (QUality ASsesment Tool) , evaluates genome assemblies by computing various metrics, including:

  1. N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  2. L50: The minimum number X such that X longest contigs cover at least 50% of the assembly
  3. NG50: where length of the reference genome is being covered
  4. NA50 and NGA50: where aligned blocks instead of contigs are taken
  5. Number of N’s per 100 kbp and GC %
  6. missassemblies: misassembled and unaligned contigs or contigs bases
  7. genes and operons covered

A clear report will generate , and which helps you to ASSESS your genome assembly

Good Luck

ADD COMMENTlink written 4 months ago by Lakshman Teja0
0
gravatar for alslonik
11 weeks ago by
alslonik30
Israel
alslonik30 wrote:

We also use BUSCO (https://busco.ezlab.org/), along with QUAST, already mentioned and statistics as size of scaffolds, percentage of gaps, N50 etc.

ADD COMMENTlink written 11 weeks ago by alslonik30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1850 users visited in the last hour