correcting for a batch in DESeq2
Entering edit mode
8.8 years ago
gracechappell ▴ 110


I have the following experimental design for an experiment in which we conducted RNA sequencing: 2 treatment groups and 2 batches, but one of the batches is exclusively one treatment.

sample  treatment  batch
3       control    A
4       control    A
5       control    B
6       control    B
7       control    B
13      control    B
14      control    B
15      BD         A
16      BD         A
17      BD         A
18      BD         A
18      BD         A

It is a poor experimental design, but unfortunately it is the data that I currently must work with. To account for my the potential batch effect of the B batch, I am using the following design formula:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + treatment)
dds$treatment <- factor(dds$treatment, levels=c("control", "BD"))
dds$batch <- factor(dds$batch, levels=c("A", "B"))
dds <- DESeq(dds, full=design(dds), reduced = ~ batch)

The results give me many less DE genes than if I simply ignored the batches and only use ~ treatment. This makes sense, because according to PCA and clustering, there is a batch effect in my samples. I've read the DESeq2 manual and many posts, but am not a statistician and would love to hear feedback if the design I'm using here makes sense, with the lack of representation of both treatment groups in the batch I am intended to correct for. Thank you!

RNA-Seq R • 12k views
Entering edit mode
8.8 years ago

Your design is correct, it's not an issue that the batch is just within one treatment since that treatment is itself present in both batch A and B. You only have a big problem when a whole group is present in its own batch.

Entering edit mode

I think this design would not really perform the kind of comparison one would expect. Have a look at my answer if you wish!

Entering edit mode

It produces the exact comparison that one expects, it's just an LRT for the effect of treatment. The only real caveat to any of this is that a batch-specific change in one group will completely throw off the results...but the only way to deal with such things is to either ignore the possibility (what's typically done) or just toss most of the data.

Entering edit mode


I have such design

    condition   batch
A1  treatment   1
A2  treatment   1
A3  treatment   1
A4  treatment   1
A5  treatment   1
A6  treatment   1
A7  control     1
A8  control     1
A9  control     1
A10 control     1
A11 control     1
A12 control     1
B1  treatment   2
B2  treatment   2
B3  treatment   2
B4  treatment   2
B5  treatment   2
B6  treatment   2
B7  control     2
B8  control     2
B9  control     2
B10 control     2
B11 control     2
B12 control     2

I am only interested in knowing genes related to batch (two experiments) ignoring any change due to treatment; Does this give me these genes?

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch + condition)
Entering edit mode

if you want to completely ignore the change due to treatment, then remove it:

dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ batch)
Entering edit mode
8.8 years ago
Martombo ★ 3.1k

I recently had a similar problem and it is actually quite a non-optimal situation. The samples present in the B batch of your case are not going to be completely useful for your analysis. I produced some results with my data, first only comparing control and BD from what in your case would be the A batch (so two samples of the control vs BD) and then comparing all controls vs BD, including the batch variable as a covariate. Here's a peek at these results:

==> A_only <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.4 -4.81 -4.89 1.3e-242
gene2 1889.02 1.62 1.64 2.4e-43
gene3 5020.71 0.92 0.92 3.5e-39
gene4 3568.32 0.91 0.92 1.0e-13

==> BA_batch <==
gene mean_base log2FC log2FC_MLE FDR
gene1 15001.47 -4.51 -4.89 6.0e-243
gene2 1889.02 1.56 1.64 1.2e-43
gene3 5020.71 0.85 0.92 2.0e-39
gene4 14800.52 0.86 0.92 1.6e-56

gene4 was modified on purpose, just to be sure of what I'm telling you. So don't consider it for the moment. Have a look instead at gene1, 2 and 3. You can see that the mean_base is the same between the two analyses and so is the unmoderated log2FC estimation (log2FC_MLE). Basically whether you include your control B samples or not, will not affect the log2FC that is being computed by DESeq2, that is because the contrast is performed between your control A samples and your BD samples. It will instead have an effect on the significance of the comparison and on the moderation of the log2FC. For gene4 I applied an additional check. I multiplied by 10 its counts in the samples control B. This means that if these samples were taken into account for the contrast, the log2FC should be very different. That is not the case! Which means that you can have 10-fold differences in your control B data and these will not influence your results.

Entering edit mode

The Log2FC should change with increasing samples, since the ability to shrink with change accordingly. Yes, of course the computed value won't change, since the batch will make the sub-group means the same, but the resulting shrinkage will differ. This is why you want more than 1 sample in each batch, otherwise the results aren't very meaningful.

Entering edit mode

yes, I'm not saying that this is the wrong approach, it's indeed the best possible, probably. I'm just pointing at some of its implications. For example if somebody wanted to invest more money in adding samples from control B, that's probably not a good idea...


Login before adding your answer.

Traffic: 1411 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6