Question

Should I perform integration to correct batch effects?

3

Entering edit mode

3 months ago

han ▴ 30

I have two single-cell RNA-seq samples: one from a liver metastasis of a wild-type tumor cell line, and the other from a liver metastasis of a drug-resistant tumor cell line. I want to compare the differences between the tumor regions in these two conditions.The experimental and sequencing workflows for both samples were completely identical (the only variable was the time).

When performing dimensionality reduction and clustering, should I apply integration to remove batch effects? I'm concerned that integration might also eliminate some true biological differences between the two tumor types.

Any suggestions or best practices would be greatly appreciated!

batch effect • 814 views

ADD COMMENT • link updated 12 weeks ago by antonioggsousa 3.4k • written 3 months ago by han ▴ 30

0

Entering edit mode

Actually, I tried two methods—one with batch effect correction and one without—and the results were completely different. In the UMAP without batch effect correction, WT and RT are completely separated, whereas after batch correction, they partially overlap.

ADD REPLY • link 3 months ago by han ▴ 30

score 7 · Accepted Answer · 2025-06-18

Hi,

Please check my previous answer to a similar post/question on this forum: When should I NOT apply batch correction for my single-cell RNAseq data?.

I also recommend you check the following bioRxiv paper comparing different integration methods for the integration of scRNA-seq cancer samples: A comparison of data integration methods for single-cell RNA sequencing of cancer samples.

Regarding your specific question, the answer depends on your aims. In my view, integration is about identifying the shared cell populations across datasets with or without a batch.

In your example, the two samples represent different biological conditions - WT versus drug-resistant cancer cell lines - and not different (technical) batches. If you have expectations about identifying shared cell populations across the two biological conditions you have, then, I would perform integration; otherwise, I would not.

There are generally three main approaches that one can do to check if the data requires or not integration (can be combined):

Dimensional reduction techniques
Automatic cell annotation (this might not apply in your case)
Independent sample analysis: clustering, cluster markers, annotation
(cluster comparison between samples)

You can check the following course materials to see in practice how this can be done for a few examples (I should disclose that I am the author of these materials): The Hitchhiker’s Guide to scRNA-seq course.

For example, check the following vignette: Cross-tissue integration task.

Regarding the difference between no integration and integration (UMAPs) results, you can check if the shared cell clusters share a good number of marker genes that you can use to confidently say these clusters are shared between biological conditions. Ideally, the clusters identified in each individual sample should map one-to-one onto the integrated clusters. This can be assessed by generating a confusion matrix comparing the clustering results obtained with and without integration.

I hope this helps.

Best,

António