Question

Geometric Mean of Two Genes in Expression Matrix

0

Entering edit mode

6.4 years ago

Ready2Rapture ▴ 20

Hello,

This is probably a very simple question.

I am currently working on an expression matrix to replicate a "cytolytic score" in R.[1] The cytolytic score is defined as "geometric mean of GZMA and PRF1". My goal is to then investigate the Pearson's correlation of this score with other genes of interest in the matrix.

I understand the concept of a geometric mean, but wanted to see if this makes mathematical sense (and is a common transformation) in expression analysis.

For example, if we have 3 samples (S1, S2, S3) for the two genes:

       S1       S2    S3
PRF1   1.1      2    0.5
GZMA   2        1    1

The transformation would be:

               S1                S2           S3
CytScore  sqrt(1.1*2)     sqrt(2*1)      sqrt(0.5*1)

Thus, by this transformation, you still have a a set of 3 values can be used in correlation calculations.

Molecular and genetic properties of tumors associated with local immune cytolytic activity doi: 10.1016/j.cell.2014.12.033

R microarray RNA RNA-Seq • 4.5k views

ADD COMMENT • link updated 6.4 years ago by Ahill ★ 1.9k • written 6.4 years ago by Ready2Rapture ▴ 20

score 1 · Answer 1 · 2017-11-20

It's not unreasonable to use a geometric mean in the way you describe, and it's certainly been used like this in gene expression analysis. But things to keep in mind. Geometric mean is probably appropriate if your input expression values are on linear scale (not log). The geometric mean is equivalent to taking the arithmetic mean of the logged values. A rationale for using the geometric mean is that if PRF1 and GZMA have very different magnitudes, the geometric mean will be less dominated by the larger one. How appropriate it is might depend on how your expression data was collected - do you expect PRF1 and GZMA to have similar magnitudes in your dataset? If GZMA is much more abundant than PRF1 but you believe both are equally important to the mechanism underlying your score, geometric mean may be a more representative summary statistic than an arithmetic mean. There are other ways of averaging gene profiles that might also be used (unless you specifically want to replicate the published score). One example: z-scoring the logged expression values and averaging the zscores. Choice is normally determined by how your input data is distributed and how you expect expression levels among genes to be correlated/related.