Question: z score transformation by population or by gene?
0
gravatar for Pietro
14 months ago by
Pietro100
Italy
Pietro100 wrote:

In calculating z-scores for microarray or RNA-Seq data, I have found two main answers on how to obtain them.

For example, in R, having a log2 expression matrix x with genes in rows and samples in columns, I would do:

zscore <- function(x) {
z <- (x - mean(x)) / sd(x)
return(z)
}

But many often suggest to use the scale base R function, on the transposed matrix. Like

mat_zscore <- t(scale(t(x)))

If I am not wrong, the two approaches are different, that is, in the first one I am subtracting population mean and dividing by population SD, while the second one operates by column by default, so transposing is done to calculate mean and SD for each gene in row.

My question is, is one of the two more correct than the other? And why are both given as valid alternatives?

Thanks

ADD COMMENTlink modified 14 months ago by Kevin Blighe63k • written 14 months ago by Pietro100
1
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

They should give the same values. Here is my proof, taking functions from pheatmap() and heatmap.2(), and comparing them to scale(): cannot replicate the pheatmap scale function

Keep in mind that we also either scale by row or by column. Your function is scaling by the global mean and global standard deviation. In a typical setting for a transcriptomics study, scale(t(x)) will scale by row.

Kevin

ADD COMMENTlink modified 14 months ago • written 14 months ago by Kevin Blighe63k

My question was more like: "Is it better to scale by global or by gene mean and SD?"

ADD REPLYlink written 14 months ago by Pietro100

Can you show an example where global mean and global sdev were used?

ADD REPLYlink written 14 months ago by Kevin Blighe63k

How transform FPKM values to Z-score using R

Or you mean an article?

ADD REPLYlink written 14 months ago by Pietro100

Both answers in that thread are old, and the answers by Seán and dariober are different, as you have also highlighted in your question.

The scale() function will always scale by column, only (you can get it to scale by row by doing t(scale(t(x)))); so, each column in the data is scaled separately. This may be more favourable in certain situations, e.g., for visualisation. However, I have never seen a comprehensive review of why one would be more favourable over the other. You may receive a better answer by posting on Cross Validated.

ADD REPLYlink modified 14 months ago • written 14 months ago by Kevin Blighe63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 701 users visited in the last hour