Question

Batch Effect Removal on non-Linearly independent samples

0

Entering edit mode

9 months ago

James • 0

The wet lab that I work with did a bulk RNA-Seq experiment. In the experiment, they had wild type and diseased cells. They then treated some of the WT and some of the diseased with an RNA methyltransferase to see see if it rescued the diseased state. Here are the specifics of the experiment with the batch, treated v. untreated, and disease state.

Treatment (T=treated; U=untreated):

T U U U T T T T U U U T

Batch:

1 2 3 4 5 5 5 5 3 4 2 1

Disease Group:

1 1 1 1 1 1 2 2 2 2 2 2

To point out the problem, it seems that the Treated samples are in batches 1 and 5, but the untreated samples are in separate batches. If I perform batch effect removal inputting the batches and the Disease groups, wouldn't this cancel out the Treatment effects? What should I do in this situation? I wasn't involved in the wet lab part of this experiment and I wasn't consulted on the planning of the experiment.

Also, if I perform batch removal, can I use all 6 samples from each Disease group to compare differentially expressed genes because the Treatment effect will have also been removed. Optimally, I would like to keep the experiment faithful to what was planned, but any tips, suggestions, or advice would be helpful.

batch rna-seq ngs • 606 views

ADD COMMENT • link updated 9 months ago by Asaf 10k • written 9 months ago by James • 0

0

Entering edit mode

What does "batch" mean in this context? Is it sequencing batch? Experimental? Maybe the treatment is considered a different batch by the lab where in practice someone else might consider it the same batch. In some cases treatments and controls can't be done together for technical reasons. I would clear this out before jumping into conclusions.

ADD REPLY • link 9 months ago by Asaf 10k

score 0 · Answer 1 · 2023-07-11

So "Rescuing the disease state" to me means a differences in differences of the form:

(Treated Disease vs Untreated Disease) vs (Untreated Disease vs Untreated Control)

The second part (Untreated Disease vs Untreated Control) is batches 2,3,4 and are OK; so you can define the "disease" signature without trouble.

The first part (Treated Disease vs Untreated Disease) is, as you mention, stratified by (5,1) vs (2,3,4). A batch effect correction method won't complain because you also have (Treated WT vs Untreated WT) in these batches as well; and correcting for batch will correct for the average treatment effect for both WT and disease; leaving any residual (Disease x Treatment) effect.

Unfortunately, this is a perfect confound, and there's very little you can do. You can use the variability batches 2, 3, 4 to set a prior on the magnitude of batch effects for 1, 5 -- and in the case of a strong treatment effect this will reduce the magnitude of correction; but the correction will still be in the "direction" of treatment. Ultimately there is no way to distinguish between treatment effects and batch effects under this fixed-effects design.

Funnily enough you can compare (Treated Disease vs Treated Control) vs (Untreated Disease vs Untreated Control) without issue since both of these are balanced. This will actually let you make statements about how treatment impacts the differentially expressed genes; but you have to make assumptions about whether it does this by making disease look more like control; or by making control look more like disease. While you cannot definitively rule out that there is a treatment effect on controls that is then "canceled out" by a similarly large -- but opposite -- batch effect; if the (disease vs control) effects are large compared to the batch 2,3,4 effects, then you can make a compelling case that "successful treatment" is far more likely than "pernicious batch effect."

One way to do this is to correct for the batch factor via a random effect, rather than a fixed effect. The only way to accomplish this at the moment is to switch from DESeq2/edgeR over to limma so you can use the duplicateCorrelation function to specify batch as a random effect; or you could take vst/voom/logTPM expression values and fit a mixed linear model using lmer. I would recommend the former.