I've been looking into ComBat lately for batch effect correction of my expression data and I've become curious of one of the features of the R implementation of the SVA package. According to the tutorial, when using ComBat one must obviously provide the expression matrix to correct and the batches vector, however there is also the option to provide a matrix of adjustment variables to be considered for the correction process. As I understand, these variables are other factors (additional to the batches) that may be introducing noise to the expression data; the aformentioned tutorial and other examples around the web always assume there are none of these variables and just pass an intercept (a vector of all ones) as the adjustment parameter for the ComBat function. I believe the method's paper also doesn't provide an example in which actual adjustment variables are used in the algorithm.
My question is how can we exactly determine that we can use a certain variable as an adjustment variable during batch effect correction? To determine if we need batch correction at all, a frequent strategy is to use some technique that helps us visualize data (such as PCA) an then see if samples are actually grouping by the variable of interest (i. e. healthy vs. disease) which is the desirable behavior, or if there is observable batch bias in the behavior of the samples. Can we use a similar strategy for determining if a certain measurement would be useful as an adjustment variable? Another question is if wether or not these variables should always be directly associated with the batch (such as date of batch or protocol used for extracting the RNA of a particular batch). Could variables such as RNA quality or gender/age of the subjects from which the samples were taken be considered as adjustment variables?