Assigning taxonomic features in CAT: incorrect species annotation?
2
0
Entering edit mode
21 days ago
DriesB ▴ 110

Hello Biostars,

I am trying to use CAT (the Contig Assignment Tool: publication & repository), to taxonomically classify ORFs. I want to use the same taxonomic rank when classifying.

I use the NCBI database.


CAT gives me results where the taxonomy and taxid-lineage columns look as follows:

ORF0; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; Methanomassiliicoccaceae; Methanomassiliicoccus; Methanomassiliicoccus luminyensis; 1,131567,2157,28890,2283796,183967,1235850,1577788,1080709,1080712;

(No comment here. Fully assigned open reading frame.)

ORF1; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; no support; no support; no support; 1,131567,2157,28890,2283796,183967,1235850

(No comment here. This ORF can only be assigned up to order Methanomassiliicoccales.)

ORF2; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon; 1,131567,2157,28890,2283796,183967,1235850,1577790,1906667
ORF3; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon; 1,131567,2157,28890,2283796,183967,1235850,1577790,1906667

Taxonomy for these two open-reading frames can be resolved up to order Methanomassiliicoccales. But why is a species 'Methanomassiliicoccales archaeon' assigned? I think I cannot assume that these two ORFs are from the same species. My intuition would therefore be to remove the species annotation. Do you agree?

ORF4; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon PtaB.Bin215; 1,131567,2157,28890,2283796,183967,1235850,1577790,1811728
ORF5; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon RumEn M1; 1,131567,2157,28890,2283796,183967,1235850,1577790,1713724

Here the species annotation adds bin information. The question remains, however, whether 'Methanomassiliicoccales archaeon PtaB.Bin215' and 'Methanomassiliicoccales archaeon RumEn M1' can be assumed to be different species? Also, if we assume that ORF4 and ORF5 are from different species, shouldn't their annotation then be copied to genus and family rank?

If we do not assume the bin information to be valuable, I think the species annotation should be removed here as well. Do you agree?


What is your view on my questions (marked bold)?

metagenomics taxonomy CAT • 468 views
ADD COMMENT
0
Entering edit mode

I have already raised this issue at the CAT repository, but I think this is in fact a bioinformatics question, rather than a bug in the software. Also as the output seems to be the result of the structure of the NCBI Taxonomy database.

ADD REPLY
0
Entering edit mode

P.S. It has been some time since my last post on the site, but I am very happy to see that the website is doing very well!

ADD REPLY
1
Entering edit mode
21 days ago

If we do not assume the bin information to be valuable, I think the species annotation should be removed here as well. Do you agree?

The bin information is actually valuable if the bin is good enough to be cosidered by GTDB as the representative genome of a novel Methanomassiliicoccales lineage. See for example RumEn M1 in GTDB: https://gtdb.ecogenomic.org/searches?s=al&q=GCA_001421185.1

If you want to solve this problem, you should pick GTDB as the reference database. This will solve the problem for ORFs assigned to bin/MAGs that remain unclassified in NCBI while being fully classified in GTDB.

ADD COMMENT
0
Entering edit mode

Thank you for the resolute reply!

ADD REPLY
0
Entering edit mode
20 days ago

As you said yourself, this is an issue with the NCBI Taxonomy database.

The Methanomassiliicoccales archaeon taxonomy ID is 1906667:

$ echo 1906667 | taxonkit lineage | taxonkit reformat
1906667 cellular organisms;Archaea;Candidatus Thermoplasmatota;Thermoplasmata;Methanomassiliicoccales;unclassified Methanomassiliicoccales;Methanomassiliicoccales archaeon       Archaea;Candidatus Thermoplasmatota;Thermoplasmata;Methanomassiliicoccales;;;Methanomassiliicoccales archaeon

As you can see from the ;;; this NCBI Taxonomy entry has no genus or family assigned, thus the NA in your output. That was probably done by the original submitter.

Methanomassiliicoccales archaeon PtaB.Bin215 is a different 'species' with a different taxonomy ID, but the same deal with the entry:

$ echo 1811728| taxonkit lineage | taxonkit reformat

1811728 cellular organisms;Archaea;Candidatus Thermoplasmatota;Thermoplasmata;Methanomassiliicoccales;unclassified Methanomassiliicoccales;Methanomassiliicoccales archaeon PtaB.Bin215   Archaea;Candidatus Thermoplasmatota;Thermoplasmata;Methanomassiliicoccales;;;Methanomassiliicoccales archaeon PtaB.Bin215

You cannot copy those to genus/species rank because Methanomassiliicoccales is the order, so higher than genus/species/family, it just looks like a species-level assignment due to the way the NCBI taxonomy database works. In other words, your ORFs hit two species about which the original submitters only know the order. You could try something like GTDB-Tk to see whether you get different hits for those bins.

ADD COMMENT
0
Entering edit mode

Thank you for replying. However, isn't this mostly rephrasing the question, instead of answering it?

ADD REPLY
0
Entering edit mode

Also, if we assume that ORF4 and ORF5 are from different species, shouldn't their annotation then be copied to genus and family rank?

With this quote from my question, I do not mean copying down from order-level (I think you have understood it as that), but I mean that the species-level assignment should be copied to genus- and family-level, because (my assumption:) taxonomic knowledge of species always includes knowledge of the levels above.

ADD REPLY

Login before adding your answer.

Traffic: 5800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6