I found your post a little late, but I found it so interesting that a will give it a late answer. In your description of your experiment, you do not inform of the experiment design, if the batches contained the same proportion of samples from your biological groups. Based on your observations I am pretty sure it was quite unbalanced.
A couple of years ago I went trough the same experience as you. I had a severely unbalanced data set with batch effects which I tried to salvage using ComBat in what seemed like the recommended way. I thought my results was too good to be true. So I did, as you, several sanity checks with permuted labels and random numbers. The results were equally good clusterings of the "groups" and long lists differentially expressed genes. After some more reading and head-scratching I could not figure out what I did wrong, so I asked for help at the bioconductor list:
Is ComBat introducing a signal where there is none?
However, I was unable to attract any responders. Anyway, batch effects was such a common problem at my work that we just had to figure this out. With the help of experienced and more mathematically skilled colleges, we looked further into this and wrote a paper
"Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses"
Our advice is:
"For an investigator facing an unbalanced data set with batch effects, our primary advice would be to
account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a
covariate should only be performed with great caution, and the batch-adjusted data should not be trusted
to be “batch effect free”, even when a diagnostic tool might claim so."
So instead of creating a batch effect free data set, one should strive to keep the data as is, with all effects, and include the batch (and other) effects in the analsysis, for instance blocking for batch in limma.
I will try to answer your questions
Are we tricking ourselves with batch effect correction?
Yes, if we treat the new created data set as "batch effect free", i.e. assuming this is what data would have been if we did not have batch effects in the first place. The batch correction tools which creates a new data set first need to estimate the batch effect and then remove that estimate from the data. However this estimate do have an estimation error, so what happens in practice is that the original batch effect is replaced with the estimation error. You will still have batch effects (the estimation errors) in your data.
How can we be sure that these methods indeed eliminate true/meaningful batches and not just overfit the data?
In the project I was working on we wanted to analyse the data with a tool that could not itself correct for batch-effects (like for instance limma can), therefore I had to use ComBat anyway. However I did not include the biological groups as covariates, and when doing PCA on the batch-adjusted data set, my groups seemed to cluster. This convinced me that the variation in my data now was more due to biology than to the batch effect. However, my data would still contain batch effects, and the results later on had to be treated with caution.
By doing the before and after plotting there are ample possibilities to trick your self into think that the batch effects is gone and you are left with a group effect. This can be achieved with clustring/pca plots, but also the more dedicated tool PCVA has been used. PCVA estimates the effects in a data set. If you feed PCVA with the same information (batch,group) as you did in the batch removal step, then no batch effect will be estimated, if it was, it would also have been estimated by the removal-tool and removed in the "clean" data. The remaining batch effect (the estimation error) will not be detected as a batch effect, but some of it will be identified as a group effect if your data set is very unbalanced. I found this by doing a lot of simulations with known group/batch effects.
How can we be sure we are not tricking ourselves with these methods?
By knowing their limitations, i.e the "clean data" will contain batch effects as well. The more ideal solution would be not to use them, preferably by not having batch effects in the first place, or having a balanced design, or account for batch in the statistical analysis to be performed.