There are various methods to normalize single cell RNAseq data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4848819/ Is TPM is a reasonable approach for normalization of ScRNAseq data?
There is still a lot of debate in the field regarding the best way to normalize scRNA-seq data. It seems that the most popular tool right now is Seurat. The normalization it uses by default is TPM, except to 10K reads instead of 1M. Thus, TPM may not be the best option, but is certainly a reasonable approach.
I would look into SCnorm package (https://github.com/rhondabacher/SCnorm).
To evaluate the extent to which biases introduced during normalization affect the identification of DE genes, we applied MAST9 (FDR = 0.05) to identify DE genes between the H1-1M and H1-4M conditions. Normalization with SCnorm resulted in the identification of no DE genes, whereas MR, TPM, scran, SCDE, and BASiCS resulted in 530, 315, 684, 401, and 1147 DE genes, respectively, being identified. The majority of DE calls made using data normalized from these latter approaches are lowly expressed genes (Fig. 2 (b)), which appear to be over-normalized (Fig. 2 (a)). Supplementary Fig. S4 shows similar results using H9 cells.
Fold-changes and DE genes calculated from the H1 case study data. For each gene, the fold-change of non-zero counts between the H1-4M and H1-1M groups was computed for data following normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Box-plots of gene-specific fold-changes are shown in panel (a) for data normalized by each method. The number of genes identified as DE using MAST is shown in panel (b). Genes are divided into four equally sized expression groups based on their median among non-zero un-normalized expression measurements and results are shown as a function of expression group.
In my case, it actually worked better in comparison to SCDE and TPM normalized counts.
I hope this helps.
TPM normalization (like any naive per-million scaling) is unable to properly correct for differences in library composition. For an illustration why naive methods perform poorly, please see this video. The same goes for naive log-scaling and I an actually not sure why this is still commonly used in Seurat.
An alternative that actually produces normalized counts on the log-scale are is the size-factor method implemented in the
scran package via the function
calculateSumFactors(), publication https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7.
These logcounts can then be used for either feature selection and PCA or (if necessary) for batch correction approaches.
I recommend the Bioconductor workflow for details: https://osca.bioconductor.org/
Other tools could be SCtransform which is now integrated in Seurat. The problem here is that SCtransform does not produce counts but returns the residuals from the model it fits. It is (by best knowledge) still not clear whether these can be used for differential analysis. There are multiple issues on GitHub in the Seurat repository but afaik no official and bullet-proof statement from the developers. If this has been changed by now feel free to add a comment.
I personally use the
vst SCtransform on each individual sample to obtain highly variable genes (HVGs), perform batch correction and clustering on the normalized counts from
calculateSumFactorsusing the aforementioned HVGs, and aggregate cells per cluster into pseudobulks for actual differential analysis via edgeR. The latter obviously requires biological replicates.