Question

Average Nucleotide Identity for bins

0

Entering edit mode

3 months ago

shevch2009 ▴ 20

Hello all!

I have a shotgun dataset, from which I was able to get some bins with different completeness and contamination rates. Now, I want to calculate Average Nucleotide Identity to see if our MAGs are new species. I am planning to use FastANI. I need reference genomes that will match the MAGs I have.

Most of our bins are assigned (GTDB) to genus/family level, with just a few assigned to species.

I have a question: for example, I have one bin that was assigned to d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Chthoniobacterales;f__UBA10450;g__AV40;s__ ... completeness 100, although it has only 23 contigs.

What I think I need to do is to download all species from the g__AV40 genus, which is available on the GTDB website, but they all have different completeness and it’s not reference data — I mean those bacterial genomes were not from isolates but rather MAGs, and there are no complete genomes available on NCBI if I use the original name of the genus.

So the issue is: I can get those genomes (from GTDB), but they are not really reference genomes.

What should I do in this situation?

Thanks, Best, Alla

ANI bin data shotgun • 749 views

ADD COMMENT • link 3 months ago by shevch2009 ▴ 20

score 1 · Answer 1 · 2025-07-29

What I think I need to do is to download all species from the g__AV40 genus, which is available on the GTDB website, but they all have different completeness and it’s not reference data — I mean those bacterial genomes were not from isolates but rather MAGs, and there are no complete genomes available on NCBI if I use the original name of the genus.

You have already answered your own question. Can't use the data that is not available. There are many bacterial genera without a single complete genome. In fact, the overwhelming majority are in that category, because very few isolates are out there with completed genomes compared to the number of MAGs.

I will try to correct a few misconceptions you seem to have. MAG completeness is not estimated from the genome size or the number of contigs. It is estimated from the number of single-copy marker genes in MAGs. That's why a MAG at 1.8 Mb and 30 contigs and a MAG at 3 Mb and 25 contigs can both be 100% complete and with 0% contamination. I picked a random group of several MAGs from my computer to illustrate this.

Name        Completeness  Contamination  MAG_size  Contigs  Largest
group_003   99.66         0.00           2399243   48       235868      
group_009   100.00        0.00           3316968   309      145584
group_020   99.43         3.61           2829889   205      131228
group_027   96.58         4.01           4592524   224      166430
group_029   91.67         0.65           3909026   304      94718
group_059   95.73         2.37           1651387   207      36337
group_036   95.76         0.00           1947777   141      138361

Second, I think you might be assigning too much significance to whether you have a new species or not. It is difficult to tell that something is a new species with certainty because, as you noted, we don't have a good resolution. It may feel important because "how many people really discover a new species?" but the reality is that anyone who's analyzed a 50+ MAG metagenome has probably discovered a new species.

I think what should be most important for your purpose is that you have a seemingly 100% complete Verrucomicrobiota MAG that belongs to genus AV40. Any conclusion beyond that might be stretching it, and frankly I don't think it is very important to conclude anything beyond that. But there is no harm if you go find GTDB species representatives for your group and determine their ANI against your MAG.