Hello!
I'm fairly new to scRNA-seq and bioinformatics in general, and am just starting to observe the nuances that go into the analysis (in this case, batch effect correction). In investigating how to deal with batch effects, I've found lots of potential packages to deal with batch effects when integrating two or more separate datasets together into one, such as the following:
- Seurat's integration pipeline (https://satijalab.org/seurat/articles/integration_introduction.html) which uses CCA and mutual nearest neighbors (MNN) to identify shared cell types to "draw the datasets together",
- or Harmony (https://github.com/immunogenomics/harmony), which instead uses PCA for dimensions reduction and models a cell by batch-independent and batch-dependent terms (the latter of which is removed)
I have not found much information on how to deal with the batch effects occurring between samples/patients within ONE dataset/study, however. By what I understand, how to address this is still an open question in the field, and seems to be something you try to minimize when designing your experiment, rather than something you actively try to correct for after the fact. However, I am using publicly available datasets with no access to the raw data, and I am pretty sure there are batch effects just between the samples of the individual datasets I am using.
So my question is, without access to the raw data, what are some ways I can alleviate (and detect) batch effects within the samples of ONE dataset/study?
The closest answers I have seen are
- SCTransform (https://github.com/ChristophH/sctransform), which tries to separate gene expression from non-biological technical variables (such as sequencing depth),
- and Batchelor's (https://github.com/LTLA/batchelor/blob/master/DESCRIPTION) FastMNN() function (though while the latter can take one object as its input, it still seems to be more appropriate to use for multiple datasets).
For a TL;DR version, are there more packages (in R or python) or strategies (like maybe quality control/pre-filtering methods) out there that you guys know of that can help alleviate the batch effects present within just one dataset (no merging, no integration, just looking at ONE study and its data)?
Thank you for reading!
Before going deeper, can you give more details on the data you have. You say
intra-sample
but does that really mean effects within the same set of cells that have been produced during the exact same 10X run as a single sample? Or are you talking about samples of the same study, that might have been produced on different days, or with different pertubations or conditions? For the latter I likefastMNN
from batchelor very much. It is efficient, fast and without large memory footprint (unlike a recent paper that states the opposite), and gives decent results. Seurat and harmony do the same kind of correction. SCtransform is more a transformation method that a batch correction, in fact it is not suitable alone to integrate datasets but can be used for ranking genes by residual variance (with respect to the model it fits) which then can be used for feature selection or as a sort of "normalized count", the latter is not so super well established I think though.Hello! Thank you for your answer!
When I say "
intra-sample
" I mean the latter - samples of the same study produced in different days/conditions/etc. I was also wondering - can you "integrate" samples within the same study (but that have different conditions and such) using Seurat or Harmony or other similar packages, or is "integration
" something you do for different studies? If that makes sense?