I am currently trying to create a theoretical argument that a microbe's phenotype can affect gene expression in their host. I have 5 species of microbes, each with a different COG (Cluster of Orthologous Genes) profile. I've been using python and a large database (DualSeqDB) of significant changes in gene expression to compare gene expression in a host and the COG profiles of my 5 species.
This is an example of the dataframe I'm working off of:
columns are counts of COGs present in a microbial species and rows are species
I was wondering if there is any way to compare the species and their COG profiles. My end goal is to draw parallels between how prevalent a certain phenotype is and how much that species can affect a host's gene expression.
If this seems infeasible, would a better approach be to BLAST the species and compare their genome sequences phylogenetically? (this would change the nature of my argument, so I'm hesitant)
I'm not the best with statistics so I'm sorry if there's a really obvious answer I'm not thinking of. If anyone knows of any resources I could use to read up on this stuff more on my own, I would really appreciate that as well.