My biological question of interest revolves around the inference of differentially expressed genes (DEGs) when comparing different stages of breast cancer. The dataset consists of 67 samples derived from biopsies of 25 patients. For each patient, multiple measurements are available, corresponding to various stages of disease progression (from normal tissue to early neoplasia, ductal carcinoma in situ, and invasive ductal carcinoma).
I am primarily interested in identifying differences between two conditions at a time (e.g., normal vs. early neoplasia), but I am currently facing challenges in constructing an appropriate design matrix. I understand that I need to account for patient identity to properly identify DEGs; however, patients are represented unevenly across conditions. This imbalance generates an unnecessarily large number of columns in the design matrix, leading to a number of variables that is roughly equal to the number of observations.
design <- model.matrix(~ disease_state + patient_number, data = attributes)
# This creates columns for patients represented by only one sample
Since some patients are represented by only one sample (and thus one condition), while others have samples across up to three conditions (e.g., normal, early, and invasive), my proposed solution is to include only patients who are represented by more than one sample in the design matrix. However, I am unsure whether this approach might introduce bias or other issues.
dup <- factor(as.numeric(duplicated(attributes$patient_number) |
duplicated(attributes$patient_number, fromLast = TRUE)) *
as.numeric(attributes$patient_number))
design <- model.matrix(~ disease_state + dup, data = attributes)
# Proposed method to account for multiple samples from the same patient (as the disease progresses)
Any feedback is greatly appreciated!
There is no random effect in limma/edgeR (as far as I know) in order to take into account patient effect as you propose in the first chunk of code. Are you unsatisfied with the results using a design matrix ignoring patient effet (
design <- model.matrix(~ disease_state, data = attributes)
)?Did you perform a MDS plot to check some patient effect should be removed?
Moving this answer to a comment since it seems to contradict answer provided by Gordon Smyth (author of edgeR).
Thanks for the answer!
Yes and yes, I am trying to remove the patient effect as I observed a clear pattern of clustering of the samples by patient number in a PCA plot. By not taking into account for the patient number I would miss many of the DEGs, but for example, in the comparison between "ductal carcinoma in situ" and "invasive ductal carcinoma", with 22 available samples, I would have gotten a design matrix of 19 patient columns (with 6 samples representing separate conditions from the same patient). Which would affect the power of the model, reason why I have proposed to only add the columns of individuals represented in multiple samples to the design matrix.