Batch-correction of partially confounded RNA-seq data
6 weeks ago
James Ashmore ★ 3.1k

Say I have RNA-seq data from 3 conditions (A,B,C) and a single batch (X)

  X
A 2
B 2
C 2


I then decide that I want to include 2 more conditions (D, E) from a separate batch (Y)

  Y
D 2
E 2


Now, I want to calculate differential expression between all pairs of conditions.

My initial impression is that there is no way to salvage this design. In order to account for differences due to batch I would need to have included samples from conditions D/E in batch X and conditions A/B/C in batch Y - is that a reasonable conclusion?

Additionally, I wondered whether if I just included samples from conditions A/B/C in batch Y whether I could batch-correct the data based on the subset of shared conditions? In this way, it seems the data is only partially confounded.

6 weeks ago

Yes, having samples in both batches is the one and only way to correct for batch effect.

At a bare minimum, you need only one sample sequenced in both batches to correct for batch effect. It is best if you have more of course, but as we usually assume that a batch effect affects all sample similarly (no batch:sample interaction), it is statistically not required to have all samples in both batches.That being said, the more data you have, the best batch effect correction you can make.

