Question

Normalization and batch effects correction in RNA-Seq data

1

Entering edit mode

3.6 years ago

elb ▴ 250

Hi guys, I have a simple question. I have RNA-Seq data from different batches. As suggested looking at many posts on-line I have pre-normalized my data (using the TMM from edgeR) then I have corrected them using Combat and then I have re-normalized them (for the library-size) using DESeq2. My question is: is it correct the second normalisation after Combat? Or at least is it not dramatically not-correct?

Thank you in advance

Best

RNA-Seq Combat • 4.7k views

ADD COMMENT • link updated 3.6 years ago by rpolicastro 13k • written 3.6 years ago by elb ▴ 250

score 4 · Accepted Answer · 2020-09-04

A quick note since this is a common problem, but for batch correction you generally need to have multiple conditions per batch. If all your WT samples are in one batch, and all your KD samples are in another batch, you can't correct for it (as an example).

With that being said, you can usually add batch as a covariate to the regression formula in edgeR and DESeq2 as the simpler and more robust option. Your study design would look like the following example for DESeq2:

> df
     condition   batch
WT-1        WT batch_1
WT-2        WT batch_2
WT-3        WT batch_2
KO-1        KO batch_1
KO-2        KO batch_2
KO-3        KO batch_2

Your regression formula would then be ~ condition + batch, which means your differential expression results for condition will be corrected for batch.

score 3 · Accepted Answer · 2020-09-04

3

Entering edit mode

3.6 years ago

Kevin Blighe 87k

Hey,

Sorry, this is not recommended (by me, and others):

then I have corrected them using Combat

The way in which you have implemented this batch-correction procedure is neither ideal, irrespective of the use of ComBat, due to the fact that you are normalising your data twice, and by 2 different programs.

Could you please read a recent answer that I gave, here: https://support.bioconductor.org/p/133222/#133225

Kevin

ADD COMMENT • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin, the design of the experiment of the posted question you redirected me to, is al little bit different from my case because I don't have nested design/s. In any case, people suggest, generally, to pre-normalize data in order to remove some high-level variability and then perform batch-correction. I agree with you about the way to correct, i.e. basically using the batch as a covariate. My question is if the normalization after the correction that is basically a ratio of the genes by the library size of each sample is wrong or it is expected not to affect dramatically the identification of variable genes across conditions (i.e. DEGs). Thank you a lot for your help!

ADD REPLY • link 3.6 years ago by elb ▴ 250

0

Entering edit mode

You have accepted that answer from rpolicastro; so, I will assume that the problem has been addressed and avoid responding further.

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k