Question

When should I NOT apply batch correction for my single-cell RNAseq data?

0

Entering edit mode

11 months ago

e.r.zakiev ▴ 230

Hello,

I have found myself scratching my head after applying Seurat's current (v4) FindIntegrationAnchors and IntegrateData pipeline which relies on CCA, to my scRNAseq dataset. I am wondering if I am eliminating biologically relevant signal as well.

In my data, there are 16 phenotypically distinct samples which were prepared in the wet lab at different days (we are talking about timepoint progression of ~14 days between the initial timepoint and the last timepoint). There is one initial seeded batch of cells which split into three parts and subject to:

no treatment
treatment A
treatment B

For all the three branches we collect cells at several progression timepoints (day 3, day 6, day 9, day 12 and day 14), giving us in total

5 samples for the non-treated condition
5 samples subjected to the treatment A
5 samples subjected to the treatment B

plus the initial initial batch at day 0.

The samples after collection were fixed on the respective days.

The fixed samples were then encapsulated, had libraries prepared for them, and sequenced at one go, in one batch, hence there is no sequencing batch effect.

But there is most certainly a batch effect associated with days of collection of samples.

When I apply Seurat's integration and correction, the samples form a single blob of cells on all embeddings (left column for corrected embeddings and the right column for the uncorrected):

enter image description here

Two questions:

Is our experiment utterly borked or Seurat is too zealous in its batch-correction?
Should I even apply the batch correction given the fact that these aren't the same cells that come from different batches, but (supposedly) phenotypically different cells?

I normally would be a proponent of ALWAYS applying batch effect correction in the single-cell RNAseq setting, just in case. But here - am I doing it right?

scRNA-seq batch-correction • 2.6k views

ADD COMMENT • link updated 11 months ago by Ram 44k • written 11 months ago by e.r.zakiev ▴ 230

Ram · Accepted Answer · 2023-10-05

9

Entering edit mode

11 months ago

antonioggsousa 3.2k

Hi!

Some personal thoughts/opinions: integration is about finding the similar cell types/states across data sets (either with or without batch effect). An experimental batch corresponds to a set of samples that were processed simultaneously in the same manner and, thus, reducing the effect of technical artifacts/noise.

As I see your experiment, you've two biological conditions plus control where you want to assess/study the effect of two treatments on the same batch cells in the course of time (longitudinal experiment).

Thus I don't agree with:

there is most certainly a batch effect associated with days of collection of samples.

Unless you collect the samples from the different conditions you have in different days for the same timepoints, e.g., by starting conditions at different times, I don't see how the collection across the distinct time points could be considered a batch or introduce noise. If any (noise), should be affecting equally the different conditions (this is my opinion).

From your plots above you can clearly see that perhaps one of the biggest difference seems to be between the sample NT-D0 (which I guess refers to the original batch cells) and the remaining cells/conditions. Regarding the uncorrected/unintegrated projections, you can see that cells are being projected slightly based on condition, as I guess you would expect. I've problems identifying the control cells as they seem to be quite spread out.

All this said, let me answer your questions:

1. Is our experiment utterly borked or Seurat is too zealous in its batch-correction?

I don't think so. The CCA method in Seurat tends to prioritize batch correction over bio-conservation (see Luecken et al., 2022) which might be recommended for integration tasks where the batch is stronger or data sets more difficult to integrate as cross-species integration tasks.

Seurat documentation provides an alternative method based on RPCA (Reciprocal PCA) which prioritizes bio-conservation over batch correction (see documentation).

In case you think the CCA is being to "aggressive" over-integrating, you might try the RPCA method.

2. Should I even apply the batch correction given the fact that these aren't the same cells that come from different batches, but (supposedly) phenotypically different cells?

It depends on your goals. If you want to find the "shared cell types/states" across conditions you might want to integrate. In this case, I think you know the identity of the batch cells and, therefore, you might not want to integrate, but instead see how cells change with treatments throughout time. If the latter, I personally wouldn't integrate the data as this would mask the biological differences between treatments.

The only point that I see where integration could be beneficial is if in your batch cells you know that you've multiple/distinct cell types, such as in PBMCs, and you would like to integrate the different cell types to see how treatments affect the different cell types.

Discuss some of these points with your collaborators and look for other answers in the forum before taking a supported decision.

I hope this helps!

Best,
António

ADD COMMENT • link updated 11 months ago by Ram 44k • written 11 months ago by antonioggsousa 3.2k

3

Entering edit mode

great analysis and greatly explained, thank you!!

It was always my understanding that whenever we have samples that were not prepared under the exact same conditions, like different day of preparation, for example. Then we should try to account for that in the data. Am I bamboozling myself on that?

ADD REPLY • link 11 months ago by e.r.zakiev ▴ 230

2

Entering edit mode

As pointed out in this thoughtful and complete answer, the key question is the ratio of "biological effect" to "technical effect". Something that's not quite clear from your description is: how heterogeneous are your "cells"? One of the reasons that integration works is that a diverse population of cells constitute a whole tissue, providing a significant amount of "biological variation" to exploit when mapping axes of variation onto one another.

If, as I suspect is the case here, the initial population of cells is homogenous (i.e., a single cell line) then there is limited "ground truth biological variation" to use to anchor. As such, I agree with the original response assertion that I would not a batch-effect correcting integration; but in my analyses I would certainly include the batch as a covariate.

ADD REPLY • link 11 months ago by LChart 4.2k

0

Entering edit mode

Good question. Yes the cells come from a single cell line initially. During the process of pluripotency induction, which is the case here, I would assume that they go through some common (across the time-points) de-differentiation states, generating common cell states but of varying proportions.

ADD REPLY • link 11 months ago by e.r.zakiev ▴ 230

2

Entering edit mode

Regarding your question about SCTranform, I've to said that I'm not that familiar with its performance in practice. I mean I know the method and what it aims to deal with, i.e., stabilize better the variance of genes, particularly of lowly expressed genes, but I do not have used it enough to have a personal opinion about its performance. Although I know people that has been using it and they're happy with it.

Recently, Ahlmann-Eltze & Huber, 2023 showed that log-normalization, as it has been implemented in NormalizeData() Seurat function (with defaults) as well other tools, e.g., scanpy, works as well or better than other transformations, e.g., SCTransform. Usually I stick with the log-normalization transformation as it is easier for me to understand and compute.

I guess the choice might depend also on the use case you might have.

Best,

António

ADD REPLY • link 11 months ago by antonioggsousa 3.2k

0

Entering edit mode

Nice pointer for the paper!!

From personal experience I would agree, I also initially dabbled in using the SCTransform when it was all-new and all-the-rage initially, especially since it was portrayed as the new, cutting-edge once-and-for-all solution, but then there were papers (couldn't point to right now) showing that it's not much more efficient than the simple lognormalisation. Log-counts are also much more transparent, so... I also stick to them now.

ADD REPLY • link 11 months ago by e.r.zakiev ▴ 230

1

Entering edit mode

and by the way, what about the SCTransform integration pipeline from Seurat, do we know how bio-conservational it is as compared to CCA and rPCA?

ADD REPLY • link 11 months ago by e.r.zakiev ▴ 230