Question

Understanding why `Design matrix not of full rank. The following coefficients not estimable` during GLM

0

Entering edit mode

3.0 years ago

O.rka ▴ 710

My experiment is the following:

Temperature 26C vs. Temperature 30C
Treatment Saline vs. Treatment BMC
Timepoints 0, 2, and 3.

My intention is to create GLM model to look at differential abundance between groups:

26C vs. 30C for Saline treatment
26C vs. 30C for BMC treatment
Saline vs. BMC at 26C
Saline vs. BMC at 30C

In doing so, I would like to include all the samples possible when calculating the dispersion which is why I'm using a GLM instead of a Fisher's Exact Test for subsets of samples. I would also like to incorporate the ordered time information.

I'm trying to understand why my GLM model is failing. I tried using all of my data including the baseline T0 which was an issue with my design matrix because it was not of full rank. I understand why the baseline was an issue as it didn't add any information for any of the other categories. However, I'm not sure why this occurs for Temperature_30C.

Here's my data, code, and experiment design. I'm using Python as my main environment with an R package called edgeR in the backend using an interface called rpy2 if this helps having this info.

from rpy2 import robjects as ro
from rpy2 import rinterface as ri
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
R = ro.r
NULL = ri.NULL

random_state=0
ro.r('set.seed')(random_state)

# Package
edgeR = importr("edgeR")
limma = importr("limma")


def pandas_to_rpy2(df, rpy2_version_major=None):
    if rpy2_version_major is None:
        from rpy2 import __version__ as rpy2_version
        rpy2_version_major = int(rpy2_version.split(".")[0])
    if rpy2_version_major == 2: # v2.x
        return pandas2ri.py2ri(df)
    if rpy2_version_major == 3: # v3.x
        from rpy2.robjects.conversion import localconverter

        with localconverter(ro.default_converter + pandas2ri.converter):
            return ro.conversion.py2rpy(df)

def build_glm_model(X, design_matrix, random_state=0):

    components = X.columns

    # Run edgeR
    r_X = pandas_to_rpy2(X.T)
    r_components = ro.vectors.StrVector(components)
    r_design = pandas_to_rpy2(design_matrix)

    # Give DGEList the gene expression table with groups
    model = edgeR.DGEList(
        counts= r_X, 
        genes = r_components, 
#         group = r_y,
    )

    # Normalize the expression values
#     if kws["calculate_normalization_factors"]:
    model = edgeR.calcNormFactors(model)

    # Calculate dispersion
    model = edgeR.estimateGLMCommonDisp(model, r_design)

#     if kws["estimate_glm_trended_dispersion"]:
    model = edgeR.estimateGLMTrendedDisp(model, r_design)

    # Estimate tagwise dispersion
#     if kws["estimate_glm_tagwise_dispersion"]:
    model = edgeR.estimateGLMTagwiseDisp(model, r_design)

    # Fit
    fit = edgeR.glmFit(model, r_design)

    return (model, fit)

# Counts
X = pd.read_csv("https://pastebin.com/raw/J7kmL8Ly", sep="\t", index_col=0)

# Metadata
df_metadata = pd.read_csv("https://pastebin.com/raw/PANaC3r5", sep="\t", index_col=0)
design_matrix = pd.concat([
    pd.get_dummies(df_metadata["treatment"].fillna("Baseline"), prefix="Treatment"),
    pd.get_dummies(df_metadata["temperature"], prefix="Temperature"),
    df_metadata["collection_time_numeric"].to_frame("t"),
], axis=1).astype(int)

# Remove the baseline samples
design_matrix = design_matrix.drop("Treatment_Baseline", axis=1).query("t > 0")

# Create the GLM model 
model, fit = build_glm_model(X.loc[design_matrix.index], design_matrix)
# RRuntimeError: Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
#   Design matrix not of full rank.  The following coefficients not estimable:
#  Temperature_30C

edgeR design-matrix differential glm offtopic • 1.5k views

ADD COMMENT • link updated 3.0 years ago by Kevin Blighe 87k • written 3.0 years ago by O.rka ▴ 710

1

Entering edit mode

Also posted on Cross Validated https://stats.stackexchange.com/questions/521917/

ADD REPLY • link 3.0 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Answered by Gordon here: How to structure my edgeR model to incorporate a temporal variable and 2 categorical variables?

ADD REPLY • link 3.0 years ago by Kevin Blighe 87k