Question: normalization in single cell RNAseq
0
gravatar for kanwarjag
19 months ago by
kanwarjag1.0k
United States
kanwarjag1.0k wrote:

There are various methods to normalize single cell RNAseq data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4848819/ Is TPM is a reasonable approach for normalization of ScRNAseq data?

rna-seq • 2.1k views
ADD COMMENTlink modified 5 days ago by ATpoint38k • written 19 months ago by kanwarjag1.0k
3
gravatar for igor
19 months ago by
igor11k
United States
igor11k wrote:

There is still a lot of debate in the field regarding the best way to normalize scRNA-seq data. It seems that the most popular tool right now is Seurat. The normalization it uses by default is TPM, except to 10K reads instead of 1M. Thus, TPM may not be the best option, but is certainly a reasonable approach.

ADD COMMENTlink written 19 months ago by igor11k

I completely agree and also the limitations of having less than 80% zeros are there. Denoising tools can help to solve this problem. But yes, still a very new field.

ADD REPLYlink written 19 months ago by Gjain5.5k
1
gravatar for Gjain
19 months ago by
Gjain5.5k
Munich, Germany
Gjain5.5k wrote:

Hi,

I would look into SCnorm package (https://github.com/rhondabacher/SCnorm).

Paper link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5473255/

Important point:

To evaluate the extent to which biases introduced during normalization affect the identification of DE genes, we applied MAST9 (FDR = 0.05) to identify DE genes between the H1-1M and H1-4M conditions. Normalization with SCnorm resulted in the identification of no DE genes, whereas MR, TPM, scran, SCDE, and BASiCS resulted in 530, 315, 684, 401, and 1147 DE genes, respectively, being identified. The majority of DE calls made using data normalized from these latter approaches are lowly expressed genes (Fig. 2 (b)), which appear to be over-normalized (Fig. 2 (a)). Supplementary Fig. S4 shows similar results using H9 cells.

enter image description here

Fold-changes and DE genes calculated from the H1 case study data. For each gene, the fold-change of non-zero counts between the H1-4M and H1-1M groups was computed for data following normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Box-plots of gene-specific fold-changes are shown in panel (a) for data normalized by each method. The number of genes identified as DE using MAST is shown in panel (b). Genes are divided into four equally sized expression groups based on their median among non-zero un-normalized expression measurements and results are shown as a function of expression group.

In my case, it actually worked better in comparison to SCDE and TPM normalized counts.

I hope this helps.

ADD COMMENTlink written 19 months ago by Gjain5.5k
0
gravatar for ATpoint
5 days ago by
ATpoint38k
Germany
ATpoint38k wrote:

TPM normalization (like any naive per-million scaling) is unable to properly correct for differences in library composition. For an illustration why naive methods perform poorly, please see this video. The same goes for naive log-scaling and I an actually not sure why this is still commonly used in Seurat.

An alternative that actually produces normalized counts on the log-scale are is the size-factor method implemented in the scran package via the function calculateSumFactors(), publication https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7. These logcounts can then be used for either feature selection and PCA or (if necessary) for batch correction approaches. I recommend the Bioconductor workflow for details: https://osca.bioconductor.org/

Other tools could be SCtransform which is now integrated in Seurat. The problem here is that SCtransform does not produce counts but returns the residuals from the model it fits. It is (by best knowledge) still not clear whether these can be used for differential analysis. There are multiple issues on GitHub in the Seurat repository but afaik no official and bullet-proof statement from the developers. If this has been changed by now feel free to add a comment.

I personally use the vst SCtransform on each individual sample to obtain highly variable genes (HVGs), perform batch correction and clustering on the normalized counts from calculateSumFactorsusing the aforementioned HVGs, and aggregate cells per cluster into pseudobulks for actual differential analysis via edgeR. The latter obviously requires biological replicates.

ADD COMMENTlink written 5 days ago by ATpoint38k

TPM normalization (like any naive per-million scaling) is unable to properly correct for differences in library composition.

Yes, but it's not clear if any method can properly deal with sparse single-cell data. I keep hoping for someone to present an analysis where they demonstrate a different conclusion based on the normalization.

As Aaron Lun (of scran fame) wrote regarding VST:

In most cases, the log-transformation is probably satisfactory. ... if we put aside theoretical arguments, the widespread use of the log-transformation “in the wild” reflects its adequacy and reliability for most analysts.

ADD REPLYlink modified 5 days ago • written 5 days ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1057 users visited in the last hour