Batch correction in scRNA-seq data via fastMNN | pro/contra
1
1
Entering edit mode
13 months ago
ATpoint 50k

scRNA-seq novice here: We have four 10X scRNA-seq samples (wildtype and knockout condition) as n=2 each. Each pair (so one WT and one KO) was produced on the same day respectively, same FACS sorting machine, same lab, same technician etc, so avoiding batch effects as much as we could.

For comparative analysis between the conditions I went through the scran / OSCA workflow and now aim to integrate the datasets. Essentially the choice is now to either merge the datasets without explicit batch correction via fastMNN (and only do per-sample depth correction via multiBatchNorm to ensure equal depth across the already normalized samples) or to apply fastMNN. I tested and visualized both approaches for every replicate independently, see below, and see quite different results.

Both replicates (if no fastMNN is applied) show a reproducible separation by condition (which we expect), so probably the influence of condition is greater than any batch effect. When applying fastMNN the two conditions lose this separation.

Therefore my question: Are there situations where batch correction masks interesting biological features. Given that we see reproducible separation by condition, could it be more meaningful to not apply fastMNN? If I combine the datasets and only correct cor batch = day (so rep1 is one batch and rep2 is one batch) I manage to preserve the separation by condition. The tSNEs then pretty much look like the left panel in the plot below.

fastMNN scRNA-seq scran batchelor • 905 views
1
Entering edit mode

Are there situations where batch correction masks interesting biological features.

Absolutely.

Can you explain the fastMNN application in your case a bit more? Did you run it on all four samples or separately on the pairs? What does the UMAP for all four samples look like without any batch correction?

0
Entering edit mode
• If I combine all four samples, only apply multiBatchNorm then I get the upper plot.
• all four samples with fastMNN specifying the day of library prep as batch then it would be the plot at the bottom.
• fastMNN with default parameters (so every sample is an independent batch) this would give the same picture as the two plots in the top level post on the right, so all clustering by condition is removed.

To me it appears that the second approach is probably the most meaningful one as it removes the modest batch effect induced by the different library prep. days while leaving the differences in condition untouched (which is what I am interested in).

1
Entering edit mode

To me it appears that the second approach is probably the most meaningful one as it removes the modest batch effect induced by the different library prep. days while leaving the differences in condition untouched (which is what I am interested in).

I agree. This is akin to integrating just the WT and just the KO, i.e. correcting for the technical influence of the day. Generally it seems like you really did a pretty good job in keeping the batch effect fairly low given how close the cells of the individual conditions track each other even without batch correction.

1
Entering edit mode

So it is evident from your merge data analysis that batch correction is needed for integration. However, I feel that if you are performing batch correction by day of prep, you are merging the two condition as one object. Then the replicates are being corrected with the rep1 as reference. To me it sort of seems biased. I would have preferred batch correction by individual experiments.

Could you just analyze the two experiments separately, perform clustering and get markers. Then check when you perform batch correction (all individual samples as separate batch), do you see clusters with similar markers overlapping?

Hope you have already took care of this, but the order of sce objects is important for fastMNN, when you supply the list of sce objects, so they should be: WT1, WT2, KO1 and KO2. (WT and KO are interchangeable obviously). If you do these things, hope you post your analysis, I am interested in seeing the results.

4
Entering edit mode
13 months ago
James Ashmore ★ 3.1k

I would apply the mutual nearest neighbours correction for the exact reasons laid out by the developer here: https://osca.bioconductor.org/multi-sample-comparisons.html#sacrificing-differences

0
Entering edit mode

Thanks James for linking this section for OSCA. I apparently completely missed that one so far and it addresses exactly my concerns towards the loss of biological signal plus a follow-up question on how to perform DE on the integrated values. The section you link refers to another chapter where it is clearly stated not to perform DE on integrated data but on the unscaled ones using batch as a blocking factor.