Hi, In Seurat I would like to understand the algorithm behind

FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

My understanding : This function compute a score for each gene to select the 2000 bests for the next step, the PCA. For a gene, the more variability in the counts matrix for each cells the better. The Seurat Algorithm is described in the following paper : https://www.biorxiv.org/content/biorxiv/early/2018/11/02/460147.full.pdf page 12 : Feature Selection for better dataset

What I understand : 1) For each gene they compute the mean and variance of not normalized UMI and apply log10 for var and mean 2) If we plot var in function of mean we should have something similar to a line so they perform a polynomial regression (order 2) 3) they recompute an other variance using mean and the last regression

if you follow the tutorial https://satijalab.org/seurat/archive/v3.2/pbmc3k_tutorial.html just before FindVariableFeatures you can display first line of count matrix :

data_counts <- GetAssayData(object = pbmc, slot = "counts")

print(head(data.frame(data_counts ), 1))

which gives u a line with AL627309.1 and nine counts with 1 in each count

just after you can display the results of FindVariableFeatures:

print(head(HVFInfo(object = pbmc),23))

which gives u

mean variance variance.standardized

AL627309.1 0.0034116755 0.0034013253 0.9330441

AP006222.2 0.0011372252 0.0011363627 0.9924937

RP11-206L10.2 0.0018953753 0.0018925002 0.9627290

How can have a mean and a variance of 0.0034 with an array of [1,1,1,1,1,1,1,1,1] ? mean([1,1,1,1,1,1,1,1,1]) = 1 log10(1) = 0

i have seen a part of the source code here : https://rdrr.io/cran/Seurat/src/R/preprocessing.R but it doesn't help me

it is a technical question but if someone have an idea of how mean and var is computed it will be a pleasure to understand.

Kind regards

Matt