Question: scRNA-seq, SEURAT, NormalizeData, ScaleData, PCA, CCA ...
2
gravatar for Bogdan
20 months ago by
Bogdan1000
Palo Alto, CA, USA
Bogdan1000 wrote:

Dear all,

as I have just started reading the documentation on SEURAT for scRNA-seq (among a few other packages), I would appreciate having your answers and insights please on the following :

  1. after NormalizeData() function, why ScaleData() function is needed ?

  2. is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

  3. is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?

  4. is RunCCA() working on Normalized_Data or on Scaled_Data ?

an example of R code is at : https://satijalab.org/seurat/immune_alignment.html

thanks a lot !

-- bogdan

scrna-seq • 3.5k views
ADD COMMENTlink modified 20 months ago • written 20 months ago by Bogdan1000
7
gravatar for igor
20 months ago by
igor11k
United States
igor11k wrote:

The alignment tutorial focuses on the alignment steps. You should consult the more basic PBMCs tutorial that explains the other steps in more detail.

after NormalizeData() function, why ScaleData() function is needed ?

ScaleData() scales and centers genes in the dataset, which standardizes the range of expression values for each gene. The function additionally regresses out unwanted sources of variation such as technical noise.

is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Scaled data is used for dimensionality reduction and clustering.

is ScaleData() absolutely needed in the scRNA-seq analysis ?

It is recommended. Technically, you could skip that step and set scale.data slot to anything if you would like to see the results without that step. However, the results would be dominated by the signal of a few highly-expressed genes.

is RunCCA() working on Normalized_Data or on Scaled_Data ?

Scaled data is used for dimensionality reduction.

ADD COMMENTlink modified 15 months ago • written 20 months ago by igor11k
2

"Data is scaled to regress out "uninteresting" sources of variation such as technical noise."

I believe this is not quite right, data is scaled so that each feature (gene in this context) contributes similarly to the downstream steps. Regressing out unwanted signal, which by the way should be used with caution(1), is optional and is not the primary objective for data scaling.

1) A blog post on regression on scRNA-seq datasets

ADD REPLYlink modified 20 months ago • written 20 months ago by Haci370
1

Yes, your interpretation is true. The main purpose of scaling is to make data comparable across the genes. Regression is a secondary (and optional) effect of scaling. From ?ScaleData

Scales and centers genes in the dataset. If variables are provided in vars.to.regress, they are individually regressed against each gene, and the resulting residuals are then scaled and centered.

Indeed, in the past versions, REgression was a separate function than ScaleData

ADD REPLYlink modified 20 months ago • written 20 months ago by Santosh Anand5.1k
1

I guess I (and the Seurat tutorial) did not explicitly mention the primary objective. Yes, the scaling adjusts the range of expression values across all the genes, which will likely impact the downstream analysis far more than any additional regression. When I originally wrote the answer, I was thinking specifically in the context of the Seurat workflow in addition to the default scale function.

ADD REPLYlink written 20 months ago by igor11k

Hi Igor, thank you for your reply. If I may add a question please :

is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?

thank you !

ADD REPLYlink written 20 months ago by Bogdan1000

You go from raw to normalized to scaled.

ADD REPLYlink written 20 months ago by igor11k

ScaleData() scales and centers genes in the dataset, which standardizes the range of expression values across all the genes. The function additionally regress out unwanted sources of variation such as technical noise.

I think it would be more accurate to say "which standardizes the range of expression values for each gene." I think ScaleData() adjust the expression value gene by gene. For each gene, it build a regression model using that gene's expression level across all cells, and then shift the residual to zero and divided it by standard deviation. The "across all the genes" is not accurate.

Am I right?

ADD REPLYlink written 15 months ago by chansigit0

I edited the statement to make it more clear. I originally meant that all genes (as opposed to all cells) are scaled, but I can see how it can be interpreted in a different way.

ADD REPLYlink written 15 months ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 604 users visited in the last hour