I have a large presence/absence matrix of genes from different strains of S. aureus that I obtained through microbiome shotgun sequencing. I conducted a PCA and observed differences between the strains living in the conditions (healthy and diseased). Now, I would like to find out which genes drive this variation. My presence/absence matrix comes with genes in columns and observations/samples in rows. I have a grouping variable as class factor, which specifies which samples belong to which condition (i. e. healthy or diseased).
gene1 gene2 gene3 sample1 1 0 0 sample2 1 1 0 sample3 0 1 0
in total I have >3500 genes and 12 strains.
 healthy healthy diseased
How can I find out, which genes are (preferentially) present in the strains on diseased sites? Through a loop, I already tried a glm
glm(formula = Gene_x ~ condition, family = binomial(link = "logit"), data = gene.matrix) and received a list with an entry for every single gene that holds the results of the glm call. Is this appropriate? Can I just filter out those genes with a pvalue <0.05? And If yes, how can I do that? Would I need to adjust the pvalue for multiple testing?