I am working with the ScRNA-Seq data; the dataset has cells from human first-trimester placenta and decidua. The dataset is available at Sc Expression Atlas and is generated with Drop-Seq.
Sc Expression Atlas has a dataset available in raw format (Raw counts values before filtering and normalization) and normalized format (Untransformed expression values, normalized to counts per million). Upon plotting the violins for the number of counts of the two, I get the following plots, which is expected.
However, I saw different results when I used Seurat's normalization on the raw data. Now I am not sure what kind of normalization is done on the datasets from Sc Array Express, but it seems to be somewhat accurate as data look normally distributed (Plot 2).
I would appreciate it if someone could comment on which method among these is best suitable for this dataset. To my knowledge, all the normalizations are doing the job correctly. However, I am not sure which one to choose. Is there a way through which I can evaluate the best normalization?
If you click through to the publication, the methods section has:
The gene expression matrix for each sample was generated, and ubiquitously expressed ribosomal protein–coding (RPS and RPL) and MALAT1 noncoding RNA genes were removed. Seurat objects were created for individual samples. Only those genes that were expressed in more than three cells and cells that expressed more than 100 genes were retained. All 10x Seurat objects for individual samples were merged into one 10x combined object (MergeSeurat), followed by scaling data (ScaleData) and finding variable genes (FindVariableGenes). All Drop-seq Seurat objects for individual samples were processed through similar steps as described above to generate a single Drop-seq combined object. Next, the union of the top 2000 variable genes for each, 10x and Drop-seq, combined objects was used to perform canonical correlation analysis (CCA) between 10x and Drop-seq datasets. Then, CCA subspaces were aligned using 1:16 CCA dimensions, which was followed by integrated t-SNE visualization for all cells.
So there are some genes removed prior to scaling, which will slightly impact the normalization. There are also Seurat version difference (V2 original V4 today) and some of the methodology may have slightly changed.