Question: Interpretation of "standardized expression matrix"?
0
mike-zx200 wrote:

I've seen the term been used in some papers working with gene expression data. I assume they refer to performing z-score normalization on the expression matrix, but I would like to know if this is the right interpretation. Also, is this typically done over each gene vector (rows of a traditional expression matrix) or over the samples? (columns). Another question I have is if it is always done one way or if it depends on the downstream analysis that we want to perform. For example, I've been encountering the term in co-expression papers, sometimes they also refer to this as "zero centering the expression matrix". What about if you want to do PCA, I think in R the function `prcomp` by default performs the normalization on the columns, but could you in some situations do it over the rows before PCA?

rna-seq normalization • 95 views
modified 13 days ago by Kevin Blighe60k • written 13 days ago by mike-zx200
2
Kevin Blighe60k wrote:

Generally, yes, it can be regarded as meaning Z-scaled. The Z distribution is also referred to as the 'Standard Normal' distribution, and it proves quite a useful transformation to make in various parts of biological data analysis due to how 'readily-interpreted' are the numbers from the distribution: [source: https://mathbitsnotebook.com/Algebra2/Statistics/STstandardNormalDistribution.html]

So, if we process some single cell RNA-seq and eventually transform our data to Z-scale, a gene with Z > 3 in a particular group of cells is 3 standard deviations above the mean expression of this gene across all cells, and this is statistically significant. 5% alpha (p=0.05) on a two-tailed distribution is equivalent to absolute Z = 1.96.

`prcomp()`, by default, centers the data column-wise to have mean at roughly 0 - it does this by simply subtracting the mean of each column from all values in its respective column. However, `prcomp()`, by default, does not scale the data by diving by the standard deviation, which is what would bring it ultimately to a Z distribution, but this can be activated by simply selecting:

``````prcomp(x, center = TRUE, scale = TRUE)
``````

`prcomp()` just uses the `scale()` function 'under the hood'; so, you could look up that function. It's used a lot for heatmaps - take a look at my proof here: A: cannot replicate the pheatmap scale function

Kevin

Thank you once again for your answers Kevin, what still confuses me is that PCA does the transform column wise while the example you mentioned of scRNA-seq would do it row wise to transform each gene across all cells. Is it the same doing transform column or row wise? I would say no intuitively, but I understood that it is the same from your answer.

It would not be the same to scale row-wise or column-wise. However, note that when we use `prcomp()`, we virtually always supply the rotated (transposed) input data so that it is ultimately the genes that are scaled.

but if for some analysis you wanted to do PCA with the samples as the features, would it be ok to do the z-score transformation row-wise (genes) and then again over the columns (samples) right before PCA? For example, some correction techniques have been tested for coexpression analysis in which you do PCA like this and then you regress gene expression with the loadings of the samples as the independent terms in the regression; you proceed to coexpression calculation afterwards. I've seen this in papers but it is not explained in detail if genes are standardized and then PCA is performed with scaling over columns additional to that or if its performed without scaling.

There is no right or wrong, and, technically, one does not have to standardise anything prior to performing PCA. Methods are almost always lacking in published works, too