I've been reading on the usage of Combat since I want to apply it to some expression data (not for differential expression analysis) and I've seen some mixed information regarding the variables to include in the model matrix.
The standard call to Combat in R:
ComBat(dat=edata, batch=batch, mod=modcombat)
Now consider the situation where you have three variables of interest treatment, sex and age. Leaving as much of the variation in the data due to these covariates is desirable. Consider also you have a large list of confounder variables conf1, conf2 ... confn whose effect on the data would be best removed. From what I understand,
modcombat is the design matrix for the linear model which indicates the variables that explain the observed expression AND that we want to "remove". So it should technically be constructed like this:
model.matrix(~ conf1 + conf2 ... + confn)
Now I've also been told by a colleague that its the other way around, that the variables included in the design matrix should be the variables of interest to "keep":
model.matrix(~ treatment + sex + age)
Then the SVA package tutorial states that:
Just as with sva, we then need to create a model matrix for the adjustment variables, including the variable of interest
Which contradicts the other two approaches and makes no sense to me because if you include everything in the model (i.e.
model.matrix(~ treatment + sex + age + conf1 + conf2 ... + confn)) then how would Combat know what you want to remove and what you want to keep. Does anyone know what the correct usage actually is?
EDIT: so Ive been doing more digging around and there seems to be this systematical problem where people tend to not know how to actually use SVA or ComBat. Take for example this bioconductor post where a user asks a similar question about SVA. He even goes on to analyze how related the inferred surrogate variables are with a known confounder (batch) when he runs SVA with and without including the confounder as an "adjustment variable". He finds that the results completely contradict what initially is understood from the SVA/ComBat tutorial. As it is usual with these questions no one has responded clearly.
I think part of the problem is that the tutorial handles certain concepts loosely such as "variable of interest" and "adjustment variable". Can adjustment variables be further divided into known but "desired" confounders (e.g. Sex and Age) and known undesired confounders (e.g. Batch, RNA integrity index, etc.). Wouldn't the "desired confounders" be the same as additional variables of interest? Can there be multiple variables of interest or does it strictly have to be one?. Essentially I think a clear definition of "variable of interest" and "adjustment variable" is lacking along with examples that go beyond the usage of 1 variable of interest and no adjustment variables where no issues and doubts come up.