Question

Understanding MAG completeness and contig composition in metagenomic bins

0

Entering edit mode

3 months ago

shevch2009 ▴ 20

Hello everyone,

I have a set of metagenome-assembled genomes (MAGs) with varying completeness levels, but I call them bins) The size and number of contigs in each bin differ significantly. For example, one bin has 100% completeness but contains only 23 contigs, while another bin has around 50% completeness but includes about 650 contigs.

Is it correct to understand that these MAGs are essentially collections of contigs, some of which may represent unknown genes? To assess how complete these MAGs are (most of our bins were classified up to the genus level only), should I calculate their average nucleotide identity (ANI) against reference genomes of the corresponding genus? But how can we be sure that downloaded genomes are fully complete? In many papers describing novel MAGs if you download those fasta files its just a sets of contigs - loks like bins to me. How can one be confident that such MAGs represent near-complete or high-quality genomes, given that they are fragmented into multiple contigs?

I would appreciate any insights or references on best practices for evaluating MAG completeness and quality beyond just completeness scores.

Thank you!

data shotgun • 741 views

ADD COMMENT • link updated 3 months ago by andres.firrincieli 3.9k • written 3 months ago by shevch2009 ▴ 20

score 1 · Answer 1 · 2025-07-16

Is it correct to understand that these MAGs are essentially collections of contigs, some of which may represent unknown genes?

That is correct.

To assess how complete these MAGs are (most of our bins were classified up to the genus level only), should I calculate their average nucleotide identity (ANI) against reference genomes of the corresponding genus?

There are specific tools for this job. CheckM2, for instance, calculates the completeness and contamination of bins by examining the presence/absence and duplication rate of gene sets derived from fully annotated, high-quality genomes.

How can one be confident that such MAGs represent near-complete or high-quality genomes, given that they are fragmented into multiple contigs?

GTDB selects species representatives based on several quality metrics. For more details, see the section “Updating GTDB species representatives” in https://gtdb.ecogenomic.org/methods

edit: one more thing regarding your last question. If a taxonomic lineage is exclusively represented by MAGs or SAGs you will never know how much information you're loosing until you get an isolates of that particular lineage. The criteria used by GTDB are intended for the selection of the best representative genome of a taxonomic lineage