Question

Design matrix with uneven patient representation across stages

0

Entering edit mode

5 months ago

Lucas • 0

My biological question of interest revolves around the inference of differentially expressed genes (DEGs) when comparing different stages of breast cancer. The dataset consists of 67 samples derived from biopsies of 25 patients. For each patient, multiple measurements are available, corresponding to various stages of disease progression (from normal tissue to early neoplasia, ductal carcinoma in situ, and invasive ductal carcinoma).

I am primarily interested in identifying differences between two conditions at a time (e.g., normal vs. early neoplasia), but I am currently facing challenges in constructing an appropriate design matrix. I understand that I need to account for patient identity to properly identify DEGs; however, patients are represented unevenly across conditions. This imbalance generates an unnecessarily large number of columns in the design matrix, leading to a number of variables that is roughly equal to the number of observations.

design <- model.matrix(~ disease_state + patient_number, data = attributes)

# This creates columns for patients represented by only one sample

Since some patients are represented by only one sample (and thus one condition), while others have samples across up to three conditions (e.g., normal, early, and invasive), my proposed solution is to include only patients who are represented by more than one sample in the design matrix. However, I am unsure whether this approach might introduce bias or other issues.

dup <- factor(as.numeric(duplicated(attributes$patient_number) | 
                         duplicated(attributes$patient_number, fromLast = TRUE)) * 
              as.numeric(attributes$patient_number))

design <- model.matrix(~ disease_state + dup, data = attributes)

# Proposed method to account for multiple samples from the same patient (as the disease progresses)

Any feedback is greatly appreciated!

RNA-seq design-matrix limma • 861 views

ADD COMMENT • link 5 months ago by Lucas • 0

1

Entering edit mode

There is no random effect in limma/edgeR (as far as I know) in order to take into account patient effect as you propose in the first chunk of code. Are you unsatisfied with the results using a design matrix ignoring patient effet (design <- model.matrix(~ disease_state, data = attributes))?

Did you perform a MDS plot to check some patient effect should be removed?

ADD REPLY • link 5 months ago by SamGG ▴ 150

1

Entering edit mode

Moving this answer to a comment since it seems to contradict answer provided by Gordon Smyth (author of edgeR).

ADD REPLY • link 5 months ago by GenoMax 154k

0

Entering edit mode

Thanks for the answer!

Yes and yes, I am trying to remove the patient effect as I observed a clear pattern of clustering of the samples by patient number in a PCA plot. By not taking into account for the patient number I would miss many of the DEGs, but for example, in the comparison between "ductal carcinoma in situ" and "invasive ductal carcinoma", with 22 available samples, I would have gotten a design matrix of 19 patient columns (with 6 samples representing separate conditions from the same patient). Which would affect the power of the model, reason why I have proposed to only add the columns of individuals represented in multiple samples to the design matrix.

ADD REPLY • link 5 months ago by Lucas • 0

score 3 · Accepted Answer · 2025-05-13

3

Entering edit mode

5 months ago

Gordon Smyth ★ 8.4k

Uneven patient representation is handled in limma/edgeR by treating patient as a random effect. For RNA-seq, you would use

library(edgeR)
design <- model.matrix(~ disease_state)
fit <- voomLmFit(y, design, block=patient_number)

etc, where y is the DGEList or count matrix.

ADD COMMENT • link 5 months ago by Gordon Smyth ★ 8.4k

0

Entering edit mode

Thank you!

ADD REPLY • link 5 months ago by Lucas • 0