I have started to use Seurat to analyze data from a scRNASeq experiment, and I would like to calculate cell cycle scores for my dataset using the CellCycleScoring function (let's leave regressing out unwanted/uninteresting sources of variation out of the discussion for now). According to the “vanilla” PBMC 3K Guided Tutorial and Seurat Cell-Cycle Scoring vignette, the count data must be normalized using the NormalizeData function before invoking CellCycleScoring (so that the order of operations when not using SCTransform is: CreateSeuratObject - NormalizeData - CellCycleScoring - FindVariableFeatures - ScaleData, or alternatively CreateSeuratObject - NormalizeData - FindVariableFeatures - ScaleData - CellCycleScoring).
On the other hand, the vignette for SCTransform states that SCTransform replaces NormalizeData, ScaleData, and FindVariableFeatures. Does this imply that CellCycleScoring can be invoked directly after SCTransform (CreateSeuratObject - SCTransform - CellCycleScoring), or do the data have to be normalized by NormalizeData and then scored before invoking SCTransform (CreateSeuratObject - Normalize Data - CellCycleScoring - SCTransform), as suggested here? I am asking because apparently the number of cells in each phase differs depending on the exact procedure used, as shown below for four different scenarios.
#1. Non-normalized data obj_0<-Filtered_seurat_object obj_0<-CellCycleScoring(obj_0, g2m.features = g2m_genes, s.features = s_genes) obj_0<-FindVariableFeatures(obj_0, selection.method = "vst", nfeatures = 2000) obj_0<-ScaleData(obj_0) obj_0<-RunPCA(obj_0) a<-email@example.com %>% ggplot(aes(Phase)) + geom_bar() + ggtitle("1.Non-normalized data") + theme(plot.title = element_text(size = 8)) #2. NormalizeData & SCTransform obj_1<-NormalizeData(Filtered_seurat_object) obj_1<-CellCycleScoring(obj_1, g2m.features = g2m_genes, s.features = s_genes) obj_1<-SCTransform(obj_1, vst.flavor = "v2")e obj_1<-RunPCA(obj_1) b<-firstname.lastname@example.org %>% ggplot(aes(Phase)) + geom_bar() + ggtitle("2.NormalizeData & SCTransform") + theme(plot.title = element_text(size = 8)) #3. SCTransform only obj_2<-SCTransform(Filtered_seurat_object, vst.flavor = "v2") obj_2<-CellCycleScoring(obj_2, g2m.features = g2m_genes, s.features = s_genes) obj_2<-RunPCA(obj_2) c<-email@example.com %>% ggplot(aes(Phase)) + geom_bar() + ggtitle("3.SCTransform only") + theme(plot.title = element_text(size = 8)) #4. NormalizeData & ScaleData (No SCTransform) obj_3<-NormalizeData(Filtered_seurat_object) obj_3<-CellCycleScoring(obj_3, g2m.features = g2m_genes, s.features = s_genes) obj_3<-FindVariableFeatures(obj_3, selection.method = "vst", nfeatures = 2000) obj_3<-ScaleData(obj_3) obj_3<-RunPCA(obj_3) d<-firstname.lastname@example.org %>% ggplot(aes(Phase)) + geom_bar() + ggtitle("4.NormalizeData & ScaleData\n(No SCTransform)") + theme(plot.title = element_text(size = 8)) plot_grid(a,b,c,d, align = "h", ncol=4, labels = "AUTO", label_size = 8)
Here are the plots showing the number of cells in each phase for each of the four methods used. I cannot help noticing that the number of cells in the non-normalized dataset (1) resembles the pattern obtained after invoking SCTransform without NormalizeData (3), whereas the number of cells in each phase is identical for methods (2) and (4) in which NormalizeData was called prior to CellCycleScoring.
Given that the "SCTransform replaces NormalizeData, ScaleData, and FindVariableFeatures", I do not know which procedure, or order of operations, should be used to most accurately describe the state of the cells in the dataset. Why are the distributions of cells into phases similar for non-normalized and SCTransformed data? Does this mean that Non-normalized data can be used for CellCycleScoring? What am I doing wrong / missing here? Thanks to anyone for reading this post and helping!