I found your post a little late, but I found it so interesting that a will give it a late answer. In your description of your experiment, you do not inform of the experiment design, if the batches contained the same proportion of samples from your biological groups. Based on your observations I am pretty sure it was quite unbalanced.
A couple of years ago I went trough the same experience as you. I had a severely unbalanced data set with batch effects which I tried to salvage using ComBat in what seemed like the recommended way. I thought my results was too good to be true. So I did, as you, several sanity checks with permuted labels and random numbers. The results were equally good clusterings of the "groups" and long lists differentially expressed genes. After some more reading and head-scratching I could not figure out what I did wrong, so I asked for help at the bioconductor list:
Is ComBat introducing a signal where there is none?
However, I was unable to attract any responders. Anyway, batch effects was such a common problem at my work that we just had to figure this out. With the help of experienced and more mathematically skilled colleges, we looked further into this and wrote a paper
"Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses"
Our advice is:
For an investigator facing an unbalanced data set with batch effects, our primary advice would be to account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a covariate should only be performed with great caution, and the batch-adjusted data should not be trusted to be "batch effect free", even when a diagnostic tool might claim so.
So instead of creating a batch effect free data set, one should strive to keep the data as is, with all effects, and include the batch (and other) effects in the analysis, for instance blocking for batch in limma.
I will try to answer your questions
Are we tricking ourselves with batch effect correction?
Yes, if we treat the new created data set as "batch effect free", i.e. assuming this is what data would have been if we did not have batch effects in the first place. The batch correction tools which creates a new data set first need to estimate the batch effect and then remove that estimate from the data. However this estimate do have an estimation error, so what happens in practice is that the original batch effect is replaced with the estimation error. You will still have batch effects (the estimation errors) in your data.
How can we be sure that these methods indeed eliminate true/meaningful batches and not just overfit the data?
In the project I was working on we wanted to analyse the data with a tool that could not itself correct for batch-effects (like for instance limma can), therefore I had to use ComBat anyway. However I did not include the biological groups as covariates, and when doing PCA on the batch-adjusted data set, my groups seemed to cluster. This convinced me that the variation in my data now was more due to biology than to the batch effect. However, my data would still contain batch effects, and the results later on had to be treated with caution.
By doing the before and after plotting there are ample possibilities to trick your self into think that the batch effects is gone and you are left with a group effect. This can be achieved with clustring/pca plots, but also the more dedicated tool PCVA has been used. PCVA estimates the effects in a data set. If you feed PCVA with the same information (batch,group) as you did in the batch removal step, then no batch effect will be estimated, if it was, it would also have been estimated by the removal-tool and removed in the "clean" data. The remaining batch effect (the estimation error) will not be detected as a batch effect, but some of it will be identified as a group effect if your data set is very unbalanced. I found this by doing a lot of simulations with known group/batch effects.
How can we be sure we are not tricking ourselves with these methods?
By knowing their limitations, i.e the "clean data" will contain batch effects as well. The more ideal solution would be not to use them, preferably by not having batch effects in the first place, or having a balanced design, or account for batch in the statistical analysis to be performed.
Yesterday I've heard that some of my former co-authors had to retract a Nature paper due to invalid statistics for the null model. Ok, well that is a tough pill, that can happen to anyone. It is not easy to get that right. In fact some years back when I was less experienced not being sure if I did an analysis right used to keep me up at night ... I am still not sure if I do it right but I sleep better ;-)
But what was really eye opening to me is that when I read the paper itself it lays out extremely convincing plots, visualizations and effect sizes that show strong and radical effects that are based on the data. Only that the data in turn was no better than random noise! But then how could that possibly happen - the answer is that the dimensionality of the space is huge. There is always something that seems to fall into place. I know the authors, both are well intentioned and honest scientists. But then it also shows better than anything just how easily one can get fooled when dealing with methods they don't fully understand. You can get very convincing plots out of what is essentially random noise - I think this paper should be taught in school!
In a nutshell my personal opinion is that batch effect corrections are a valid approach when done right - but then I am also convinced that they do far more damage than good as God knows how many scientists that do not understand it apply it on their data and discover wondrous new phenomena that are not actually there.
Yes, proper negative controls are key (also in computational research!). In my case the negative control using permuted batches was kind of eye-opening and triggered this question. I would still be very much interested in an answer, but maybe this is really a fundamental limitation of batch correction methods we have to live with.
But the data from this paper did not have batch effect, right?
They did not use proper background control, which led to wrong conclusions.
However, if somebody could volunteer on how to look for potential batch effect in samples, that would be highly appreciated and very useful.
yes correct, the point I wanted to make is more about using a tool such as a pattern matcher or say batch effect normalizer that one does not fully understand yet could produce very strong signal
While I have my own thoughs on this, I just tweeted this post to Jeff Leek. Hopefully he can write a reply since he's obviously thought quite a bit about SVA/Combat.