Evaluate normalization method for scRNA-Seq Dataset
Entering edit mode
7 weeks ago
spriyansh29 ▴ 30

I am working with the ScRNA-Seq data; the dataset has cells from human first-trimester placenta and decidua. The dataset is available at Sc Expression Atlas and is generated with Drop-Seq.

Sc Expression Atlas has a dataset available in raw format (Raw counts values before filtering and normalization) and normalized format (Untransformed expression values, normalized to counts per million). Upon plotting the violins for the number of counts of the two, I get the following plots, which is expected. Figure1

However, I saw different results when I used Seurat's normalization on the raw data. Now I am not sure what kind of normalization is done on the datasets from Sc Array Express, but it seems to be somewhat accurate as data look normally distributed (Plot 2).

I would appreciate it if someone could comment on which method among these is best suitable for this dataset. To my knowledge, all the normalizations are doing the job correctly. However, I am not sure which one to choose. Is there a way through which I can evaluate the best normalization?

normalization scRNA-Seq transcriptomics seurat • 305 views
Entering edit mode
7 weeks ago
LChart 840

If you click through to the publication, the methods section has:

The gene expression matrix for each sample was generated, and ubiquitously expressed ribosomal protein–coding (RPS and RPL) and MALAT1 noncoding RNA genes were removed. Seurat objects were created for individual samples. Only those genes that were expressed in more than three cells and cells that expressed more than 100 genes were retained. All 10x Seurat objects for individual samples were merged into one 10x combined object (MergeSeurat), followed by scaling data (ScaleData) and finding variable genes (FindVariableGenes). All Drop-seq Seurat objects for individual samples were processed through similar steps as described above to generate a single Drop-seq combined object. Next, the union of the top 2000 variable genes for each, 10x and Drop-seq, combined objects was used to perform canonical correlation analysis (CCA) between 10x and Drop-seq datasets. Then, CCA subspaces were aligned using 1:16 CCA dimensions, which was followed by integrated t-SNE visualization for all cells.

So there are some genes removed prior to scaling, which will slightly impact the normalization. There are also Seurat version difference (V2 original V4 today) and some of the methodology may have slightly changed.

Entering edit mode

Hi LChart, thanks for your response. I believe I was not clear on my part in explaining the problem. I did go through the publication, as you've already quoted. However, I do not find that authors perform normalization (Seurat's normalization steps).

I came across a few articles discussing the implication of log-normalization on the count data (ref this and this). As you can see from the plots that I've posted, relative count normalization makes the counts more normally distributed than log-normalize counts ( I see this being used in almost all Seurat implementations). Therefore, I would like to know on what basis I should decide the strategy for normalization. I find Seurat RC more intuitive than Seurat log-normalize for this particular dataset; what are your thoughts?


Login before adding your answer.

Traffic: 985 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6