Hello Biostars,
I am trying to use CAT (the Contig Assignment Tool: publication & repository), to taxonomically classify ORFs. I want to use the same taxonomic rank when classifying.
I use the NCBI database.
CAT gives me results where the taxonomy and taxid-lineage columns look as follows:
ORF0; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; Methanomassiliicoccaceae; Methanomassiliicoccus; Methanomassiliicoccus luminyensis; 1,131567,2157,28890,2283796,183967,1235850,1577788,1080709,1080712;
(No comment here. Fully assigned open reading frame.)
ORF1; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; no support; no support; no support; 1,131567,2157,28890,2283796,183967,1235850
(No comment here. This ORF can only be assigned up to order Methanomassiliicoccales.)
ORF2; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon; 1,131567,2157,28890,2283796,183967,1235850,1577790,1906667
ORF3; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon; 1,131567,2157,28890,2283796,183967,1235850,1577790,1906667
Taxonomy for these two open-reading frames can be resolved up to order Methanomassiliicoccales. But why is a species 'Methanomassiliicoccales archaeon' assigned? I think I cannot assume that these two ORFs are from the same species. My intuition would therefore be to remove the species annotation. Do you agree?
ORF4; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon PtaB.Bin215; 1,131567,2157,28890,2283796,183967,1235850,1577790,1811728
ORF5; Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon RumEn M1; 1,131567,2157,28890,2283796,183967,1235850,1577790,1713724
Here the species annotation adds bin information. The question remains, however, whether 'Methanomassiliicoccales archaeon PtaB.Bin215' and 'Methanomassiliicoccales archaeon RumEn M1' can be assumed to be different species? Also, if we assume that ORF4 and ORF5 are from different species, shouldn't their annotation then be copied to genus and family rank?
If we do not assume the bin information to be valuable, I think the species annotation should be removed here as well. Do you agree?
What is your view on my questions (marked bold)?
I have already raised this issue at the CAT repository, but I think this is in fact a bioinformatics question, rather than a bug in the software. Also as the output seems to be the result of the structure of the NCBI Taxonomy database.
P.S. It has been some time since my last post on the site, but I am very happy to see that the website is doing very well!