I've followed the Seurat vignette tutorial for pre-processing my scRNAseq data.
When I look at the scaled data (using the ScaleData() function), I get values between -14.18 and 10, with an average of -0.005.
Seurat vignette says:
Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1
This what I did:
# Initialize the seurat object with the non normalized data
h5 <- CreateSeuratObject(counts = h5.data)
dataset = h5
#QC
dataset[["percent.mt"]] <- PercentageFeatureSet(dataset, pattern = "^MT-") # mitochondrial percentage
#Filtering
dataset <- subset(dataset, subset = nFeature_RNA > 250 & nFeature_RNA < 10000 & percent.mt < 15)
#Normalization
dataset <- NormalizeData(dataset, normalization.method = "LogNormalize", scale.factor = 10000)
#Most variable gene identification
dataset <- FindVariableFeatures(dataset, selection.method = "vst", nfeatures = 2000, verbose = FALSE, dispersion.cutoff = c(-Inf, 0.5), mean.cutoff = c(0.0125, 3))
#Scaling
all.genes <- rownames(dataset)
dataset <- ScaleData(dataset)
What did I do wrong?
Thank you for your answer!
What confused me is the following sentence from the Suerat package: "Scales the expression of each gene, so that the variance across cells is 1", which made me believe that if the mean is 0 and the variance 1 all the values should be between 0 and 1
Think about it, if the mean is zero and variance is != 0 then there must be negative values simply by how mean and variance work mathematically.
You're right! But then values should be between -1 and 1 right? and not between -14 and 10
No, you can simulate this:
hist(rnorm(1000, 0, 1))
, some outliers will probably produce these extreme values, but if the sample size is large enough these have a modest influence on the total variance.