Question

What's the acceptable threshold in Quality Assessment for Genome Assemblies?

0

Entering edit mode

4 months ago

JaneZheng0406 • 0

I am using QUAST in Kbase to assess the quality of my genome assemblies of bacterial isolates.

The report from QUAST provided parameters such as N50 and Mismatches. I have found their meaning in https://quast.sourceforge.net/docs/manual.html#sec3.1. And I have learned that an ideal genome is contiguous, complete, and correct.

Most studies suggest the lower the mismatches or other values are, the better the quality will be.

However, are there any absolute values/thresholds that could be used to test whether this assembly is good quality?

(Some studies showed that the threshold depends on the size of the genome and the goal of the study. Then is there any way to calculate this threshold?)

Thank you very much!

kbase QUAST quality-control genome-assembly • 696 views

ADD COMMENT • link updated 4 months ago by Ram 43k • written 4 months ago by JaneZheng0406 • 0

score 2 · Answer 1 · 2023-12-10

2

Entering edit mode

4 months ago

liorglic ★ 1.4k

I would say that the answer to your question is no, there are no absolute values which would be considered good. The contiguity, completeness, and accuracy of the assembly heavily depend on the genome itself (size, complexity, repeat content) and the data used (depth, short/long reads, read quality).
In very general terms, and in today's standards, I'd say that an N50 > 1M and assembly size covering >95% of the expected size is considered a fair assembly. But this really depends on the discipline, and more importantly on the downstream applications of the assembly. I guess others will suggest quite different numbers though...
If you want to get more specific answers, you'll need to provide some stats for your input data and resulting assemblies, as well as more details about your project.

ADD COMMENT • link 4 months ago by liorglic ★ 1.4k

0

Entering edit mode

Thank you very much for your help!

My data is derived from whole genome sequencing of bacterial isolates, the size of the genome is 300 M. The resulting assemblies are around 4~5 Mb and the value N Contigs ranged between 90~100. Assembly is operated for further annotation and pangenome analysis (pipeline in Kbase is shown below). enter image description here

Please let me know what kind of information I should supplement. Thanks a lot!

ADD REPLY • link 4 months ago by JaneZheng0406 • 0

0

Entering edit mode

I'm a bit confused: is the expected genome size 3Mb? And you're getting assemblies 4-5Mb in size and with N50 ~100kb? Did I understand correctly ? (please edit your reply to make this clear). Also, you didn't say what sequencing data was used - short/long reads? what sequencing depth? Which assembler did you use?
In any case, if your main goal is gene prediction and pangenomics, then I'd say that N50 of 100k should be good enough, because the vast majority of the genes would not be fragmented. However, my advice would be to look at some publications related to similar works and see what statistics they report. This should give you an idea of the standards in your specific discipline.

ADD REPLY • link 4 months ago by liorglic ★ 1.4k