Batch correction in DESeq2
1
3
Entering edit mode
2.8 years ago
Arindam Ghosh ▴ 450

For RNA-seq data analysis using DESeq2, a recommended method for batch effect removal is to introduce the batch in the design of the experiment as design = ~ batch + condition.

The presence of batch was already known from experiment design and also detected by PCA biplot on the log transformed raw counts. Post DESeq2 how can we assure that the batches have been correctly taken care of? PCA plot on counts(dds, normalized=TRUE) did not how any changes.

Correcting the batch using ComBat brings out the desired grouping, however, I understand the output from Combat should not be used with DESeq2. Is there any other suggested tool for differential gene expression analysis post batch correction using ComBat?

DESeq2 Batch Effect Combat • 14k views
17
Entering edit mode
2.8 years ago

You could include batch in the design formula, as you have done, and this will then be 'accounted for' (the statistical inferences will be adjusted for batch) when you perform a differential expression analysis testing for condition. batch is essentially treated as a covariate in the regression model, just as we do in association studies for sex, smoking status, BMI, etc.

If you then want to directly remove the batch effect for downstream analyses, you would do it on the variance-stabilised or rlog expression values using limma::removeBatchEffect(). There was a recent question on Bioconductor: https://support.bioconductor.org/p/125386/#125387 (follow the link to the DESeq2 vignette)

1
Entering edit mode

Hello Kevin,

I am one of the confused ones regarding removing batch effects. How is 'accounted for' different from 'removal'? When is it correct to use the 'accounted for' (i.e. using the ~batch+cond formula) approach and when is it correct for removal of the batch effect? Thank you for shedding light on these dark areas.

7
Entering edit mode

Hello jfo, when we 'account for' the batch effect, batch is being treated as a covariate / confounder in the statistical model. In regression, when we want to adjust for confounders, say smoking status, we usually include them like this in our design formula:

~ gene + smokingStatus


Then, when we derive our test statistics for gene, we known that these test statistics will have been adjusted for the effect of smoking status. The same logic applies when we are adjusting for batch and when we are designing the formula for DESeq2, Limma, EdgeR, etc.

However, in none of this is the underlying data modified. We are merely modifying / adjusting our statistical inferences based on smokingStatus, batch, etc.

So, accounting for batch in this way is only really plausible when we are performing a differential expression analysis, in the case of expression studies.

## -----------------

The above is all well and good until we actually want to take our data and use it downstream. It would make no sense to perform 'machine learning' stuff if we knew that there was a batch effect in the data. The downstream program may have a way to account for batch just like the above, but usually not. In this scenario, we have to actually aim to modify our underlying data. This usually involves estimating the effect of batch and then directly adjusting the underlying data so that the batch effect is effectively wiped away.

1
Entering edit mode

Thanks @Kevin, I do like your way of clarifying issues! I appreciate if you answer my question. I don't want to do differential expression analysis, I just want to normalize my RNAseq data and then do some clustering and classification .. Based on your explanation I realized after adding the batch in the design formula, I also have to remove the batch effect, right?

6
Entering edit mode

after adding the batch in the design formula, I also have to remove the batch effect, right?

Yes

here is a code that I tried for clustering on DESeq2 normalized counts:

dds <- DESeqDataSetFromMatrix(countData=counts, colData=factors, design = ~ Batch + Covariate)
dds <- DESeq(dds)
vsd <- vst(dds, blind=FALSE)
mat <- assay(vsd)
mat <- limma::removeBatchEffect(mat, vsd\$Batch)
assay(vsd) <- mat
counts_batch_corrected <- assay(vsd)

0
Entering edit mode

I have a similar question, I have read many posts on this subject and tried analyzing my data but got an error that I don't fully understand. I would greatly appreciate your help in clarifying what I am doing wrong.

I have RNA-seq data across different tissues, and I want to identify genes that are uniquely expressed in a tissue of interest. Some of these data were generated in the lab and sequenced in the same lane, and other data comes from public databases. Can I use DESeq2 normalization in this case? My understanding is that I can and I tried the following: type= batch

> head(expdesign)
condition  type
heart_1            heart     A
heart_2            heart     A
heart_3            heart     A
liver_1           liver      B
liver_2           liver      B
liver_3           liver      B
> dds <- DESeqDataSetFromMatrix(countData = counts,
+                               colData = expdesign,
+                               design = ~type + condition)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.

vignette('DESeq2')

2
Entering edit mode

Hello again. Based on the data that you have, these 2 designs are exactly the same:

• design = ~ condition
• design = ~ type

In this case, you can only choose one or the other.

If you are interested in uniquely expressed genes, a differential expression analysis does not directly test for that. You may try, in addition, something more simple, such as calculating gene Z-scores in each tissue.

0
Entering edit mode

Hi Kevin, Thank you very much. The idea was to use DESeq2 normalization and extract the normalized counts for downstream analysis. To identify tissue-specific genes(lung transcriptome for example), I was planning to do a correlation using the normalized counts. Is there a reference for calculating gene Z-scores in each tissue, and does this require batch correction? Thanks again!

1
Entering edit mode

Hi!

I was planning to do a correlation using the normalized counts.

What do you mean?

## -------------

Is there a reference for calculating gene Z-scores in each tissue, and does this require batch correction?

Z-scores are used generally for this purpose - these are intuitive to use as they represent standard deviations above or below the mean value.

Perhaps we first need to define what is truly "unique"? Is a unique gene one whose expression is literally 0 (zero) in one tissue but not another?; is a unique gene one whose fold-change difference in one tissue over another passes a pre-specified threshold, and that also has a statistically significant adjusted p-value?

So, perhaps first define unique?

0
Entering edit mode

Hi Kevin, You are right what I want to show is actually heat maps of Z-scores comparing expression levels across tissues similar to Figure 4 of Brown et al., 2014. I am still unsure of how I should do this analysis using data generated in two different labs.

https://www.nature.com/articles/nature12962 Thank you so much for your help!!

1
Entering edit mode

Your heart and liver samples are from different studies?

0
Entering edit mode

Yes, I have data generated in the lab(heart) and public data(liver). I also have other data that were generated in my lab but in different batches.

1
Entering edit mode

In that case, there's not much that you can do, in terms of batch correction. The best way forward is to just independently transform each dataset to Z-scores, and then compare that way.

0
Entering edit mode

Okay, thanks. Is there a need to normalize(B) or just transform the dataset to Z-scores(A)?

A)
1. Obtain raw counts
2. Transform to z-score
B)
1. Obtain raw counts
2. Normalize( e.g. DESeq2, RPKM, TMP)
3. Transform to z-score


Thanks!!

2
Entering edit mode

You definitely need to first normalise the data if it is raw counts. Then transform it, and then a final transformation to Z-scores.

Possibilities include:

## Method 1

1. Raw counts
2. Normalise
3. log(CPM + 0.1) (EdgeR) / VST or rlog (DESeq2)
4. Z-score

## Method 2

1. Raw counts
2. FPKM / RPKM
3. Z-score via zFPKM
0
Entering edit mode

I'm confused about Method 1.step 2. DESeq2 will only take raw counts what kind of normalization can I apply in step 2 that will be compatible with DESeq2?

1
Entering edit mode

DESeq2 normalises the raw counts for you via the DESeq() function.

0
Entering edit mode

Right, that is step 3 why do I need step 2 in method 1.

2
Entering edit mode

In fact, that is not step 3.

There is a Step 2, where, in the case of DESeq2, the raw counts are normalised for library size and where dispersion is modelled. It is on these normalised counts that the statistical inferences are made, i.e., the differential expression analysis.

As the normalised counts are still measured on a negative binomial distribution, they are not amenable for the majority of downstream analyses, including clustering, PCA, etc.; so, this is why we then employ a further transformation, Step 3, to bring these normalised counts to a normal distribution.

1
Entering edit mode

I see. Thank you so much for taking the time to explain this so well!!

0
Entering edit mode

In step 2 before normalising with DESeq2, would it help also to filter out the low-read-count genes? Something like:

dds = dds[rowSums(counts(dds)) > (2 * n_samples), ]

0
Entering edit mode

Does that mean that we cannot perform differential expression analysis?