Question

Search group specific genes in presence/absence matrix of pangenome

0

Entering edit mode

2.2 years ago

plicht ▴ 20

I have a large presence/absence matrix of genes from different strains of S. aureus that I obtained through microbiome shotgun sequencing. I conducted a PCA and observed differences between the strains living in the conditions (healthy and diseased). Now, I would like to find out which genes drive this variation. My presence/absence matrix comes with genes in columns and observations/samples in rows. I have a grouping variable as class factor, which specifies which samples belong to which condition (i. e. healthy or diseased).

> head(gene.matrix)

                gene1     gene2      gene3
sample1           1        0             0
sample2           1        1             0
sample3           0        1             0

in total I have >3500 genes and 12 strains.

> grouping.variable

[1] healthy healthy diseased

How can I find out, which genes are (preferentially) present in the strains on diseased sites? Through a loop, I already tried a glm Call: glm(formula = Gene_x ~ condition, family = binomial(link = "logit"), data = gene.matrix) and received a list with an entry for every single gene that holds the results of the glm call. Is this appropriate? Can I just filter out those genes with a pvalue <0.05? And If yes, how can I do that? Would I need to adjust the pvalue for multiple testing?

R statistics binary microbiome pangenome • 438 views

ADD COMMENT • link 2.2 years ago by plicht ▴ 20