My experiment is the following:
- Temperature 26C vs. Temperature 30C
- Treatment Saline vs. Treatment BMC
- Timepoints 0, 2, and 3.
My intention is to create GLM model to look at differential abundance between groups:
- 26C vs. 30C for Saline treatment
- 26C vs. 30C for BMC treatment
- Saline vs. BMC at 26C
- Saline vs. BMC at 30C
In doing so, I would like to include all the samples possible when calculating the dispersion which is why I'm using a GLM instead of a Fisher's Exact Test for subsets of samples. I would also like to incorporate the ordered time information.
I'm trying to understand why my GLM model is failing. I tried using all of my data including the baseline T0 which was an issue with my design matrix because it was not of full rank. I understand why the baseline was an issue as it didn't add any information for any of the other categories. However, I'm not sure why this occurs for
Here's my data, code, and experiment design. I'm using Python as my main environment with an R package called
edgeR in the backend using an interface called
rpy2 if this helps having this info.
from rpy2 import robjects as ro from rpy2 import rinterface as ri from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri R = ro.r NULL = ri.NULL random_state=0 ro.r('set.seed')(random_state) # Package edgeR = importr("edgeR") limma = importr("limma") def pandas_to_rpy2(df, rpy2_version_major=None): if rpy2_version_major is None: from rpy2 import __version__ as rpy2_version rpy2_version_major = int(rpy2_version.split(".")) if rpy2_version_major == 2: # v2.x return pandas2ri.py2ri(df) if rpy2_version_major == 3: # v3.x from rpy2.robjects.conversion import localconverter with localconverter(ro.default_converter + pandas2ri.converter): return ro.conversion.py2rpy(df) def build_glm_model(X, design_matrix, random_state=0): components = X.columns # Run edgeR r_X = pandas_to_rpy2(X.T) r_components = ro.vectors.StrVector(components) r_design = pandas_to_rpy2(design_matrix) # Give DGEList the gene expression table with groups model = edgeR.DGEList( counts= r_X, genes = r_components, # group = r_y, ) # Normalize the expression values # if kws["calculate_normalization_factors"]: model = edgeR.calcNormFactors(model) # Calculate dispersion model = edgeR.estimateGLMCommonDisp(model, r_design) # if kws["estimate_glm_trended_dispersion"]: model = edgeR.estimateGLMTrendedDisp(model, r_design) # Estimate tagwise dispersion # if kws["estimate_glm_tagwise_dispersion"]: model = edgeR.estimateGLMTagwiseDisp(model, r_design) # Fit fit = edgeR.glmFit(model, r_design) return (model, fit) # Counts X = pd.read_csv("https://pastebin.com/raw/J7kmL8Ly", sep="\t", index_col=0) # Metadata df_metadata = pd.read_csv("https://pastebin.com/raw/PANaC3r5", sep="\t", index_col=0) design_matrix = pd.concat([ pd.get_dummies(df_metadata["treatment"].fillna("Baseline"), prefix="Treatment"), pd.get_dummies(df_metadata["temperature"], prefix="Temperature"), df_metadata["collection_time_numeric"].to_frame("t"), ], axis=1).astype(int) # Remove the baseline samples design_matrix = design_matrix.drop("Treatment_Baseline", axis=1).query("t > 0") # Create the GLM model model, fit = build_glm_model(X.loc[design_matrix.index], design_matrix) # RRuntimeError: Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, : # Design matrix not of full rank. The following coefficients not estimable: # Temperature_30C