I want to compare my newly defined (environmental) bacterial species to a handful (<10) of other related bacterial species of the same genus to find genes/differences not present in these other related species.
Most of the tools and pipelines I've found are aimed toward comparison of different strains of a species or generally closely related strains to look for differences in genes/alleles/SNPs in shared regions, identifying core and accessory genes within a species or genus... but none for just finding and highlighting differences in their functional capacity. I'm not interested in collinearity or evidences of genome expansion/contraction. The question should be simple - is there anything that differentiates this new species from the other ones based on their functional capacity?
The bacterial genomes (assemblies of different qualities and completeness) have been uniformly annotated (PGAP and/or Bakta), I've used simply BLAST/Diamond to compare protein sequences among them to find genes/proteins of my species that do not occur in other ones at all. PIRATE and OrthoFinder for a more sophisticated approach, creating clusters of varying degrees of similarity within and across genomes.
I have all these lists of protein clusters that are unique or shared, "exact" functional annotations for most of them, and to a lesser extent some GO/KEGG/PFAM or any other identity. I've tried to see if I could make a statement based on the function of these "unique" genes, but it turns out their function isn't exactly unique. Just because I've got a unique protein cluster based on >80% or >95% identity, whose proteins were annotated as "dITP/XTP pyrophosphatase", doesn't mean any of the other genomes do not contain a protein/cluster with the same functional annotation. Or if a GO identity "GO:0016829" can be assigned to a cluster, describing certain biological processes, molecular functions, cellular components, pathways... doesn't mean that another cluster with a similar GO identity doesn't exist in the other genomes.
Another possibility I haven't finished yet, use the KEGG annotation (KO orthology and pathway), look for obvious differences/overlaps in KO numbers and names (in case names and KO numbers are not equally unique), then look at the coverage of pathways...
Any tools or pipelines for identifying functional differences between entire bacterial genomes you can think of? Other approaches or methods that would make sense for such questions?
Thanks!