Hi, I’m pretty new to RNA-Seq analysis and during the single analysis steps, a couple of questions came into my mind. Maybe first some background information of my experiment:

- mice experiment with two conditions
- 6 biological replicates in both groups
- I’m searching for DE-genes and later on also pathways
- sample and library preparation results in 4 different batch effects
- since one of these effects is in linearity with my condition, I excluded this effect
- therefore I assumed only three different batch-effects
- my starting point is a count-matrix based on exon-counts

I have a question regarding my assumed batch-effects. When I do a PCA and plot the results grouped after the different batch effects, I see no real clustering for these effects. With the raw data I also see no clustering regarding my condition. Interestingly, the DE-genes analysis differs substantially if I correct for batch-effects or not.

Additionally I also did a batch-correction with limma and the ComBat-function from sva, starting with the raw data. The goal was to check which batch-effect or which combination of these effects leads to the best result, regarding the discrimination between the two conditions. I found that, as for the DE-genes analysis, correcting for all three batches is the best method.

As mentioned above I see no batch-effect-clustering in my raw data, so I want to ask you, whether it is valid to correct for batch-effects if I cannot see them in a PCA plot. Maybe I destroy my data while correcting for these effects? Is it valid to create some real artificial batch-effects and then check the analysis with these effects? Are there other methods to check whether my batch-effects are real?

Furthermore, I want to try the svaseq-function from sva but I did it not yet.

Please let me know, whether you need more information.

Thanks for your help in advance.

First of all sorry for the late reply. I assume batch effects because dealing with such tiny amounts of material (like for library preparation etc.) can always lead to variations when one is doing the lab stuff e.g. on different days. What I asked myself is, how does e.g. DESeq2 perform when one is adding these effects to the model (design) but these are not existent. Is this considered by the functions? So if I include them in the design and these effects are not present would the result of this model be similar to the native model (only with the condition of interest)? As mentioned above, I found a substantial difference for these two models (designs). I would guess that something like overfitting is possible but of course not desirable.

It all depends. As you said in your original post you weren't seeing a strong batch-effect in PCA analysis of your data. I'm not entirely clear from your response what you did and what you saw. Did you try and model batch effects in DESeq2 and compare it to not accounting for batch effects and see a difference in the results (presumably Differential Expression?)

I haven't typically used DESeq2 so I can't directly comment. If you try using BallGown you can easily add a factor to your samples for the batches and include it as a potential confounder in the data. The problem though is typically that if you have a limited number of samples such that samples simultaneously vary in treatment condition and batch your statistical power drops like a rock when you try and capture the batch effects in the data.

To be honest, while we should be accounting for potential batch effects on our data, we often don't because we just don't have the money to do the sampling proper up front. We frequently gather experimental data and replicates on different days and mostly ignore it. Checking, as you did, with some PCA analysis to see if batches are explaining a bunch of variance in the data or clustering things together is a good check to see whether you can safely assume that ignoring the batches will be "OK"

Hi Dan, thanks for your fast answer. Yes exactly, I found a huge difference in the differential expression between the raw model (only diet was included) and the so called fully adjusted model, where I put three potential batch effect factor variables in the model.

It makes a lot of sense that the power drops as you described. I never thought on that before.

However, I tried another procedure to validate my potential batch effects. What would you say regarding this procedure, also in terms of loss of statistical power? new post