Question: Batch correction in DESeq2
2
gravatar for Arindam Ghosh
16 months ago by
Arindam Ghosh340
Finland
Arindam Ghosh340 wrote:

For RNA-seq data analysis using DESeq2, a recommended method for batch effect removal is to introduce the batch in the design of the experiment as design = ~ batch + condition.

The presence of batch was already known from experiment design and also detected by PCA biplot on the log transformed raw counts. Post DESeq2 how can we assure that the batches have been correctly taken care of? PCA plot on counts(dds, normalized=TRUE) did not how any changes.

Correcting the batch using ComBat brings out the desired grouping, however, I understand the output from Combat should not be used with DESeq2. Is there any other suggested tool for differential gene expression analysis post batch correction using ComBat?

batch effect combat deseq2 • 3.8k views
ADD COMMENTlink modified 3 months ago by Kevin Blighe70k • written 16 months ago by Arindam Ghosh340
11
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe70k
Republic of Ireland
Kevin Blighe70k wrote:

You could include batch in the design formula, as you have done, and this will then be 'accounted for' (the statistical inferences will be adjusted for batch) when you perform a differential expression analysis testing for condition. batch is essentially treated as a covariate in the regression model, just as we do in association studies for sex, smoking status, BMI, etc.

If you then want to directly remove the batch effect for downstream analyes, you would do it on the variance-stabilised or rlog expression values using limma::removeBatchEffect(). There was a recent question on Bioconductor: https://support.bioconductor.org/p/125386/#125387 (follow the link to the DESeq2 vignette)

ADD COMMENTlink modified 16 months ago • written 16 months ago by Kevin Blighe70k

Hello Kevin,

I am one of the confused ones regarding removing batch effects. How is 'accounted for' different from 'removal'? When is it correct to use the 'accounted for' (i.e. using the ~batch+cond formula) approach and when is it correct for removal of the batch effect? Thank you for shedding light on these dark areas.

ADD REPLYlink modified 8 months ago • written 8 months ago by jfo40
4

Hello jfo, when we 'account for' the batch effect, batch is being treated as a covariate / confounder in the statistical model. In regression, when we want to adjust for confounders, say smoking status, we usually include them like this in our design formula:

~ gene + smokingStatus

Then, when we derive our test statistics for gene, we known that these test statistics will have been adjusted for the effect of smoking status. The same logic applies when we are adjusting for batch and when we are designing the formula for DESeq2, Limma, EdgeR, etc.

However, in none of this is the underlying data modified. We are merely modifying / adjusting our statistical inferences based on smokingStatus, batch, etc.

So, accounting for batch in this way is only really plausible when we are performing a differential expression analysis, in the case of expression studies.

-----------------

The above is all well and good until we actually want to take our data and use it downstream. It would make no sense to perform 'machine learning' stuff if we knew that there was a batch effect in the data. The downstream program may have a way to account for batch just like the above, but usually not. In this scenario, we have to actually aim to modify our underlying data. This usually involves estimating the effect of batch and then directly adjusting the underlying data so that the batch effect is effectively wiped away.

ADD REPLYlink written 8 months ago by Kevin Blighe70k
1

Thanks @Kevin, I do like your way of clarifying issues! I appreciate if you answer my question. I don't want to do differential expression analysis, I just want to normalize my RNAseq data and then do some clustering and classification .. Based on your explanation I realized after adding the batch in the design formula, I also have to remove the batch effect, right?

ADD REPLYlink written 7 months ago by Rahil180
3

after adding the batch in the design formula, I also have to remove the batch effect, right?

Yes

here is a code that I tried for clustering on DESeq2 normalized counts:

dds <- DESeqDataSetFromMatrix(countData=counts, colData=factors, design = ~ Batch + Covariate)
dds <- DESeq(dds)
vsd <- vst(dds, blind=FALSE)
mat <- assay(vsd)
mat <- limma::removeBatchEffect(mat, vsd$Batch)
assay(vsd) <- mat
counts_batch_corrected <- assay(vsd)
ADD REPLYlink modified 7 months ago • written 7 months ago by Arindam Ghosh340

I have a similar question, I have read many posts on this subject and tried analyzing my data but got an error that I don't fully understand. I would greatly appreciate your help in clarifying what I am doing wrong.

I have RNA-seq data across different tissues, and I want to identify genes that are uniquely expressed in a tissue of interest. Some of these data were generated in the lab and sequenced in the same lane, and other data comes from public databases. Can I use DESeq2 normalization in this case? My understanding is that I can and I tried the following: type= batch

> head(expdesign)
                 condition  type
heart_1            heart     A
heart_2            heart     A
heart_3            heart     A
liver_1           liver      B
liver_2           liver      B
liver_3           liver      B
> dds <- DESeqDataSetFromMatrix(countData = counts,
+                               colData = expdesign,
+                               design = ~type + condition)
Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')
ADD REPLYlink modified 3 months ago • written 3 months ago by DataFanatic300
2

Hello again. Based on the data that you have, these 2 designs are exactly the same:

  • design = ~ condition
  • design = ~ type

In this case, you can only choose one or the other.

If you are interested in uniquely expressed genes, a differential expression analysis does not directly test for that. You may try, in addition, something more simple, such as calculating gene Z-scores in each tissue.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe70k

Hi Kevin, Thank you very much. The idea was to use DESeq2 normalization and extract the normalized counts for downstream analysis. To identify tissue-specific genes(lung transcriptome for example), I was planning to do a correlation using the normalized counts. Is there a reference for calculating gene Z-scores in each tissue, and does this require batch correction? Thanks again!

ADD REPLYlink modified 3 months ago • written 3 months ago by DataFanatic300
1

Hi!

I was planning to do a correlation using the normalized counts.

What do you mean?

-------------

Is there a reference for calculating gene Z-scores in each tissue, and does this require batch correction?

Z-scores are used generally for this purpose - these are intuitive to use as they represent standard deviations above or below the mean value.

Perhaps we first need to define what is truly "unique"? Is a unique gene one whose expression is literally 0 (zero) in one tissue but not another?; is a unique gene one whose fold-change difference in one tissue over another passes a pre-specified threshold, and that also has a statistically significant adjusted p-value?

So, perhaps first define unique?

ADD REPLYlink written 3 months ago by Kevin Blighe70k

Hi Kevin, You are right what I want to show is actually heat maps of Z-scores comparing expression levels across tissues similar to Figure 4 of Brown et al., 2014. I am still unsure of how I should do this analysis using data generated in two different labs.

https://www.nature.com/articles/nature12962 Thank you so much for your help!!

ADD REPLYlink modified 3 months ago • written 3 months ago by DataFanatic300
1

Your heart and liver samples are from different studies?

ADD REPLYlink written 3 months ago by Kevin Blighe70k

Yes, I have data generated in the lab(heart) and public data(liver). I also have other data that were generated in my lab but in different batches.

ADD REPLYlink modified 3 months ago • written 3 months ago by DataFanatic300
1

In that case, there's not much that you can do, in terms of batch correction. The best way forward is to just independently transform each dataset to Z-scores, and then compare that way.

ADD REPLYlink written 3 months ago by Kevin Blighe70k

Okay, thanks. Is there a need to normalize(B) or just transform the dataset to Z-scores(A)?

A)
1. Obtain raw counts
2. Transform to z-score 
B)
1. Obtain raw counts
2. Normalize( e.g. DESeq2, RPKM, TMP)
3. Transform to z-score

Thanks!!

ADD REPLYlink modified 3 months ago by Kevin Blighe70k • written 3 months ago by DataFanatic300
1

You definitely need to first normalise the data if it is raw counts. Then transform it, and then a final transformation to Z-scores.

Possibilities include:

Method 1

  1. Raw counts
  2. Normalise
  3. log(CPM + 0.1) (EdgeR) / VST or rlog (DESeq2)
  4. Z-score

Method 2

  1. Raw counts
  2. FPKM / RPKM
  3. Z-score via zFPKM
ADD REPLYlink written 3 months ago by Kevin Blighe70k

I'm confused about Method 1.step 2. DESeq2 will only take raw counts what kind of normalization can I apply in step 2 that will be compatible with DESeq2?

ADD REPLYlink written 3 months ago by DataFanatic300
1

DESeq2 normalises the raw counts for you via the DESeq() function.

ADD REPLYlink written 3 months ago by Kevin Blighe70k

Right, that is step 3 why do I need step 2 in method 1.

ADD REPLYlink written 3 months ago by DataFanatic300
1

In fact, that is not step 3.

There is a Step 2, where, in the case of DESeq2, the raw counts are normalised for library size and where dispersion is modelled. It is on these normalised counts that the statistical inferences are made, i.e., the differential expression analysis.

As the normalised counts are still measured on a negative binomial distribution, they are not amenable for the majority of downstream analyses, including clustering, PCA, etc.; so, this is why we then employ a further transformation, Step 3, to bring these normalised counts to a normal distribution.

ADD REPLYlink written 3 months ago by Kevin Blighe70k
1

I see. Thank you so much for taking the time to explain this so well!!

ADD REPLYlink written 3 months ago by DataFanatic300

In step 2 before normalising with DESeq2, would it help also to filter out the low-read-count genes? Something like:

dds = dds[rowSums(counts(dds)) > (2 * n_samples), ]
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Ömer An230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2221 users visited in the last hour
_