Hey, I have tried harmony or CCA for batch effect correction for my single-cell RNA-seq data to compare the differeces between tumor and normal tissues, but I found that when I tried to integrate all the samples by harmony or CCA, the results showed an over-correction between tumor and normal tissues, e.g. exhausted T cells, which were only present in tumors, could be found on normal tissues after batch effect correction. How can I solve this problem? Can I solved this problem by modifying some of the parameters in RUNHarmony or Findintegrationanchors function? or any function else?
This is in my experience a common problem. CCA and similar algorithms force cells to cluster close together. In you case tumor and normal effects are probably considered as the batch effect and therefore removed. Please give some details. Do you have replicates per tumor/normal so that you can check whether you really have a batch effect in terms of unwanted technical variation that is worth correcting?
Thank you for your answer. For example, when I was running T cells in the tumor tissues, I found that there was a group of CD8+ T cells, which highly expressed exhausted T cell markers like HAVCR2, ENTPD1 and LAG3, and this group of T cells was termed exhausted T cells, which was common in tumor microenvironment. But I found that after batch correction by CCA or harmony across all samples, I found that this group of cells was also shown in normal tissues, but this cluster of T cells in normal tissues didn't expressed exhausted markers like HAVCR2 or ENTPD1 at all. Based on the priori knowledge, I know that this cluster of T cells was not shared by cancer and normal tissues, but it was over-corrected by these algorithms. Can I fix this problem by modifying some of the parameters in Seurat or harmony to prevent this over-correction?
You are repeating what you said in your question. I understand the underlying problem and agree that this often the problem with integration procedures when the actual interesting variation between datasets is being actively removed.
My question was:
In other words, did you check if you can go without integration?
Sorry for misunderstanding your question. I have tried running the results without integration in 4 pairs of samples, and the results showed that this cluster of exhausted T cells was not present in normal tissues, but when I was running other types of cells, like fibroblasts or neutrophils, the batch effect was obvious, manifested by some cell clusters were dominated by one of the samples.
Unfortunately I can not directly help with your question. However, I am currently working on a new method trying to overcome this problem. May I ask if the data you are using is published or accessible somewhere so I could try and test my own method on it for batch correction?