Question: How To Assess The Quality Of An Assembly? (Is There No Magic Formula?)
19
gravatar for diltsjeri
4.2 years ago by
diltsjeri390
Richmond, VA
diltsjeri390 wrote:

Hi,

I'm having a difficult time finding a consensus method for assessing the quality of an assembly.

Are there "best" methods to use based on the organism type, technology, and sequence quality? I know N50 is a value I should use to assess assembly quality, but is this only metric?

Thanks.

assembly quality next-gen • 11k views
ADD COMMENTlink modified 3.5 years ago by Prakki Rama1.9k • written 4.2 years ago by diltsjeri390
2

'Quality' can be a very subjective thing. The Assemblathons, as well as contests like GAGE and dnGASP, seem to indicate that assemblies can be high quality in a few areas of interest, but it is hard to make an assembly that excels in all aspects of quality. If you are only interested in one aspect of assembly quality, e.g. finding genes in a genome assembly, then it may not matter whether scaffolds are really long (e.g. > 10 Mbp), only that scaffolds mostly contain whole genes.

N50 can tell you something about the average length of scaffolds and/or contigs. It is meaningless to compare the N50 values of any two assemblies unless they are the same size. It is also possible to artificially raise N50 by deliberately excluding short contigs/scaffolds and/or increasing the padding of Ns within scaffolds. One of the figures we include in the Assemblathon 2 paper suggests that N50 can be a semi-useful predictor of assembly quality. Some of the most highly-ranked assemblies had high N50 values...but not all of them did, and some which had high N50 values did not rank as highly.

To give you a succinct, but somewhat disappointing, answer to your question, I would say:

There is no magic formula.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by kbradnam20
18
gravatar for zam.iqbal.genome
4.2 years ago by
United Kingdom
zam.iqbal.genome1.5k wrote:

N50 is most definitely not the only thing to look at. How you should asses it basically depends on what you want to do with the assembly.

You could check out this paper recently submitted to the Arxiv

http://arxiv.org/pdf/1301.5406

"Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species"

Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)

and also the previous Assemblathon paper. Also check out papers by Steven Salzberg and Mihai Pop on this subject, plus the references within all of the above. There are many others which I can't think of off the top of my head, I'm sure others will suggest some

best Zam

ADD COMMENTlink written 4.2 years ago by zam.iqbal.genome1.5k
3

As you mentioned GAGE, I am actually concerned with this evaluation. For small genomes, the authors intentionally mix 50% of short-insert reads and 50% of long-insert reads by thinning the source data. When assembling, they largely treat the two types of reads the same apart from orientation and insert size. If the assembler does not consider the exceptionally high chimeric rate of long-insert reads, the performance will be very bad, as is shown in the table. However, in practice, short-insert reads are cheaper and of much better quality than long-insert. An better approach would be to sequence more short-insert reads, assemble them first and then only use long-insert to build scaffolds. As such, GAGE might only be evaluating a scenario that may not represent the best practice.

Assemblathon 1/2 is truly amazing which I like a lot.

ADD REPLYlink written 4.2 years ago by lh328k

Hi Heng. I didn't mention GAGE at all, I mentioned Steven Salzberg. I was thinking of papers like these

http://bioinformatics.oxfordjournals.org/content/21/24/4320.full http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0021400

http://books.google.co.uk/books?hl=en&lr=&id=UrKGLrmpRZAC&oi=fnd&pg=PA163&dq=info:vd-c54xEXwAJ:scholar.google.com&ots=tIw9P31XE9&sig=0L3wqezpwzlFJtcH28HEunu_ZHc&redir_esc=y#v=onepage&q&f=false

http://www.biomedcentral.com/content/pdf/gb-2008-9-3-r55.pdf

cheers

Zam

ADD REPLYlink written 4.2 years ago by zam.iqbal.genome1.5k

Yeah, their reviews are very good. Thanks for the clarification.

ADD REPLYlink written 4.2 years ago by lh328k
8
gravatar for Madelaine Gogol
4.2 years ago by
Madelaine Gogol4.8k
Kansas City
Madelaine Gogol4.8k wrote:

I like the paper answer above, but if you're just looking for some additional measuring sticks besides N50, you could also think about:

  • number of contigs
  • Length of longest/shortest contigs
  • Average length of contigs
  • Total length of all contigs
  • Length of 10/100/1000/10000 longest contigs
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Madelaine Gogol4.8k
8
gravatar for Manu Prestat
4.2 years ago by
Manu Prestat3.7k
Marseille, France
Manu Prestat3.7k wrote:

I would add: the number of annotations you can grab from your contigs or ORFs you can predict as "information content" estimates.

ADD COMMENTlink written 4.2 years ago by Manu Prestat3.7k
6
gravatar for earonesty
4.2 years ago by
earonesty190
United States
earonesty190 wrote:

I use a dup-mer-21 calculation to compare assemblies based on this conversaion:

http://www.homolog.us/blogs/2012/06/26/what-is-wrong-with-n50-how-can-we-make-it-better-part-ii/

Source code:

http://ea-utils.googlecode.com/svn/trunk/clipper/contig-stats

This lets you know if there is excessive chimerism ... a common error.

ADD COMMENTlink written 4.2 years ago by earonesty190
1

The article correctly points out that evaluating N50 only is frequently misleading, but the last paragraph is questionable. When there is ambiguity about whether A should be connected to B or to C, the right decision is not to perform any joining. If we force a join, we will get longer N50 at the cost of high error probability at the junction. An aggressive assembler will get longer N50 but more misassemblies in that case.

ADD REPLYlink written 4.2 years ago by lh328k

Which is what the dup-mer-21 will detect... overaggressive assemblers. You should see the same kmer represented in multiple locations when the assembler is more aggressively calling connections in its graph than it should.

It's easy to produce a single contig. It's hard to get it right.

ADD REPLYlink modified 3.5 years ago • written 4.0 years ago by earonesty190

@earonesty: Could i please know how to intrpret the dup-mer-cnt, dup-pct-21 when comparing assemblies? Should they be high or low?

ADD REPLYlink written 3.2 years ago by Prakki Rama1.9k

They should be "comparable to expected".     In other words...you should benchmark it to an existing quality assembly.   Some k-mer duplication is, of course, expected.   What the "correct" number is varies from organism to organism.   As a rule, I would expect longer genome to have more.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by earonesty190
6
gravatar for Rayan Chikhi
4.2 years ago by
Rayan Chikhi1.1k
France, Lille, CNRS
Rayan Chikhi1.1k wrote:

QUAST and FRCurve are two recent tools that should definitely be considered when evaluating assemblies.

QUAST computes a comprehensive set of classical metrics. It can reproduce the GAGE benchmark.

FRCurve computes newer metrics related to correctness.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Rayan Chikhi1.1k
3
gravatar for SES
4.2 years ago by
SES7.8k
Vancouver, BC
SES7.8k wrote:

Regardless of your biological question, I think looking at length statistics alone can be very misleading and uninformative because 1) the percentage of Ns in scaffolds may be very high and 2) there is always some level of contamination (from organelles, but also other species, possibly) in draft shotgun assemblies, in my experience. How you define "quality" is important to your assessment of the assembly, but the common goal is to try and represent the actual genomic sequence of an organism, so some things to check are:

  • Sequence content of contigs/scaffolds.
  • Levels of contamination (aside from sequence contamination, there are also assembly artifacts to be aware of, as others mentioned).
  • Gene content/accuracy.

The last two points can be assessed by looking at the reference genome or gene models, respectively, of your species or a closely related species. There are many recent papers on comparing genome assemblies so I won't list any paper or tools (too easy to google), but I will mention a method for inferring the gene content. CEGMA is a set of conserved genes in eukaryotes and may be biologically informative, especially if your organism is a non-model species and you have no transcriptome or even closely related species for comparison.

ADD COMMENTlink written 4.2 years ago by SES7.8k
0
gravatar for Prakki Rama
3.5 years ago by
Prakki Rama1.9k
Singapore
Prakki Rama1.9k wrote:

you can also check this Assessing The Quality Of De Novo Assembled Data

ADD COMMENTlink written 3.5 years ago by Prakki Rama1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1527 users visited in the last hour