Question

Should matched samples (not paired) be included in the DESeq2 design model?

0

Entering edit mode

5 months ago

marieke • 0

Hi,

I'm working on several projects that require differential expression and I have a question regarding DESeq2 design model for matched samples (not paired).
I don't know if there is a standardized way of using these terms, but assuming that:

Matched data will be two different populations in which an attempt has been made to reduce the variables by matching for certain characteristics that might impact the response being studied but which aren't the focus of the study. For example, age, gender, smoking status, etc.
Paired data is two populations of numbers in which the same variable has been measured on the same population usually at two different times, or under two different conditions. For example, before and after treatment with a given drug.
(copied from https://community.jmp.com/t5/JMP-Wish-List/matched-versus-paired-data/idi-p/546968)

From the DESeq2 vignette, I understand that we should include paired samples in the design (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples). However, is this also the case for matched samples?
I'm specifically interested in age-matched control and disease samples from healthy/sick donors.

Thank you, Marieke

EDIT: side note on specific situation: adjacent tumor/healthy tissue from same individual
In the EdgeR documentation, there is an example with 'RNA-Seq of oral carcinomas vs matched normal tissue' where they use the word matched for this situation and then add it to the design model: design <- model.matrix(~Patient+Tissue)
So here, the answer would be 'yes, you should add matching samples to the design model'. However, this situation is quite specific and could be somewhere in between the definitions of matched/paired written above. EdgeR defines paired samples as:

Paired samples occur whenever we compare two treatments and each independent subject in the experiment receives both treatments.
but does not specify what they would consider matched samples

edgeR limma DESeq2 • 775 views

ADD COMMENT • link updated 5 months ago by i.sudbery 22k • written 5 months ago by marieke • 0

score 0 · Answer 1 · 2025-06-19

0

Entering edit mode

5 months ago

i.sudbery 22k

You add an extra covariate to the design matrix when you expect that it will explain some of the variance in gene expression. In general, i think this IS the case for matched samples.

Including additional explanatory co-variates means you have more coefficients to fit in your model. As we know, the more coefficients you need to fit, the more samples you need to do it, because these coefficients increase the number of degrees of freedom used. This reduces your power. However, adding coefficients reduces the variance for the coefficient you are interested in measuring. Which increases power. Generally, your overall power will be increased if the co-variate explains a substantial portion of the variance, and decreaed if it doesn't.

One way to check would be to look at a PCA. Do the points cluster in one of the first few PCs according to the levels of your matching criteria?

My intution is that you should see if things cluster in PCA, and then run tests on that coefficient looking for differences, because that would be post-hoc double dipping. But I think (and a statistician might correct me if i'm wrong), this is okay if you are fitting this coefficient in order to ignore it.

ADD COMMENT • link 5 months ago by i.sudbery 22k

0

Entering edit mode

Ah yes, looking at the PCA makes sense, thanks!

I'm not sure I understand your last paragraph, though. With 'running tests', do you mean running DESeq with and without that coefficient and checking the difference in results? This would then be double dipping, but not really an issue according to you if it's to ignore a coefficient.

ADD REPLY • link 5 months ago by marieke • 0

0

Entering edit mode

No thats not what I mean.

When you run DESeq, if fits values to each of the coeffients that you specify. You then compute p-values on some of those coefficients - this is what I mean by doing a test. If you were to do PCA, see a separation by a factor, then add that to the design and compute a p-value for the coefficient, that would be double dipping. Commuting p-values for your coefficient of interest (e.g. Cancer/Not Cancer) with and without the matching factor would also be dodgy practice. However, I think deciding whether or not to include the matching criteria from a PCA is okay, because you don't compute a p-value on the coefficient fit to the value of the matching factor (e.g. in your example above, you don't ask if each patient is significantly different from each other).

ADD REPLY • link 5 months ago by i.sudbery 22k