Question: correcting for a batch in DESeq2
gravatar for gracechappell
4.9 years ago by
United States
gracechappell70 wrote:


I have the following experimental design for an experiment in which we conducted RNA sequencing: 2 treatment groups and 2 batches, but one of the batches is exclusively one treatment. 

sample    treatment    batch
3    control    A
4    control    A
5    control    B
6    control    B
7    control    B
13  control    B
14  control    B
15   BD    A
16   BD    A
17   BD    A
18   BD    A
18   BD    A

It is a poor experimental design, but unfortunately it is the data that I currently must work with.  To account for my the potential batch effect of the B batch, I am using the following design formula:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + treatment)
dds$treatment <- factor(dds$treatment, levels=c("control", "BD"))
dds$batch <- factor(dds$batch, levels=c("A", "B"))
dds <- DESeq(dds, full=design(dds), reduced = ~ batch)

The results give me many less DE genes than if I simply ignored the batches and only use ~ treatment.  This makes sense, because according to PCA and clustering, there is a batch effect in my samples. I've read the DESeq2 manual and many posts, but am not a statistician and would love to hear feedback if the design I'm using here makes sense, with the lack of representation of both treatment groups in the batch I am intended to correct for.  Thank you!

rna-seq R • 8.0k views
ADD COMMENTlink modified 4.9 years ago by Martombo2.6k • written 4.9 years ago by gracechappell70
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

Your design is correct, it's not an issue that the batch is just within one treatment since that treatment is itself present in both batch A and B. You only have a big problem when a whole group is present in its own batch.

ADD COMMENTlink written 4.9 years ago by Devon Ryan96k

I think this design would not really perform the kind of comparison one would expect. Have a look at my answer if you wish!

ADD REPLYlink written 4.9 years ago by Martombo2.6k

It produces the exact comparison that one expects, it's just an LRT for the effect of treatment. The only real caveat to any of this is that a batch-specific change in one group will completely throw off the results...but the only way to deal with such things is to either ignore the possibility (what's typically done) or just toss most of the data.

ADD REPLYlink written 4.9 years ago by Devon Ryan96k


I have such design

    condition   batch
A1  treatment   1
A2  treatment   1
A3  treatment   1
A4  treatment   1
A5  treatment   1
A6  treatment   1
A7  control     1
A8  control     1
A9  control     1
A10 control     1
A11 control     1
A12 control     1
B1  treatment   2
B2  treatment   2
B3  treatment   2
B4  treatment   2
B5  treatment   2
B6  treatment   2
B7  control     2
B8  control     2
B9  control     2
B10 control     2
B11 control     2
B12 control     2

I am only interested in knowing genes related to batch (two experiments) ignoring any change due to treatment; Does this give me these genes?

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + condition)
ADD REPLYlink written 18 months ago by A3.8k

if you want to completely ignore the change due to treatment, then remove it:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch)
ADD REPLYlink written 5 weeks ago by Ömer An200
gravatar for Martombo
4.9 years ago by
Seville, ES
Martombo2.6k wrote:

I recently had a similar problem and it is actually quite a non-optimal situation. The samples present in the B batch of your case are not going to be completely useful for your analysis. I produced some results with my data, first only comparing control and BD from what in your case would be the A batch (so two samples of the control vs BD) and then comparing all controls vs BD, including the batch variable as a covariate. Here's a peek at these results:

==> A_only <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.4 -4.81 -4.89 1.3e-242
gene2 1889.02 1.62 1.64 2.4e-43
gene3 5020.71 0.92 0.92 3.5e-39
gene4 3568.32 0.91 0.92 1.0e-13

==> BA_batch <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.47 -4.51 -4.89 6.0e-243
gene2 1889.02 1.56 1.64 1.2e-43
gene3 5020.71 0.85 0.92 2.0e-39
gene4 14800.52 0.86 0.92 1.6e-56


gene4 was modified on purpose, just to be sure of what I'm telling you. So don't consider it for the moment. Have a look instead at gene1, 2 and 3. You can see that the mean_base is the same between the two analyses and so is the unmoderated log2FC estimation (log2FC_MLE). Basically whether you include your control B samples or not, will not affect the log2FC that is being computed by DESeq2, that is because the contrast is performed between your control A samples and your BD samples. It will instead have an effect on the significance of the comparison and on the moderation of the log2FC. For gene4 I applied an additional check. I multiplied by 10 its counts in the samples control B. This means that if these samples were taken into account for the contrast, the log2FC should be very different. That is not the case! Which means that you can have 10-fold differences in your control B data and these will not influence your results.


ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Martombo2.6k

The Log2FC should change with increasing samples, since the ability to shrink with change accordingly. Yes, of course the computed value won't change, since the batch will make the sub-group means the same, but the resulting shrinkage will differ. This is why you want more than 1 sample in each batch, otherwise the results aren't very meaningful.

ADD REPLYlink written 4.9 years ago by Devon Ryan96k

yes, I'm not saying that this is the wrong approach, it's indeed the best possible, probably. I'm just pointing at some of its implications. For example if somebody wanted to invest more money in adding samples from control B, that's probably not a good idea...

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Martombo2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour