Geometric Mean of Two Genes in Expression Matrix
Entering edit mode
6.6 years ago


This is probably a very simple question.

I am currently working on an expression matrix to replicate a "cytolytic score" in R.[1] The cytolytic score is defined as "geometric mean of GZMA and PRF1". My goal is to then investigate the Pearson's correlation of this score with other genes of interest in the matrix.

I understand the concept of a geometric mean, but wanted to see if this makes mathematical sense (and is a common transformation) in expression analysis.

For example, if we have 3 samples (S1, S2, S3) for the two genes:

       S1       S2    S3
PRF1   1.1      2    0.5
GZMA   2        1    1

The transformation would be:

               S1                S2           S3
CytScore  sqrt(1.1*2)     sqrt(2*1)      sqrt(0.5*1)

Thus, by this transformation, you still have a a set of 3 values can be used in correlation calculations.

  1. Molecular and genetic properties of tumors associated with local immune cytolytic activity doi: 10.1016/j.cell.2014.12.033
R microarray RNA RNA-Seq • 4.6k views
Entering edit mode
6.6 years ago
Ahill ★ 2.0k

It's not unreasonable to use a geometric mean in the way you describe, and it's certainly been used like this in gene expression analysis. But things to keep in mind. Geometric mean is probably appropriate if your input expression values are on linear scale (not log). The geometric mean is equivalent to taking the arithmetic mean of the logged values. A rationale for using the geometric mean is that if PRF1 and GZMA have very different magnitudes, the geometric mean will be less dominated by the larger one. How appropriate it is might depend on how your expression data was collected - do you expect PRF1 and GZMA to have similar magnitudes in your dataset? If GZMA is much more abundant than PRF1 but you believe both are equally important to the mechanism underlying your score, geometric mean may be a more representative summary statistic than an arithmetic mean. There are other ways of averaging gene profiles that might also be used (unless you specifically want to replicate the published score). One example: z-scoring the logged expression values and averaging the zscores. Choice is normally determined by how your input data is distributed and how you expect expression levels among genes to be correlated/related.

Entering edit mode

Thank you Akhil, this makes sense. The the paper they uses Transcripts per Million (TPM) count from RNA-seq in calculating the score.

The data I'm looking at is a normalized Affy microarray experiment, but I can get the relative signal intensity as well. I am skeptical, however, of using this measurement for a microarray due to the statistical noise. I do like your z-scoring approach which I'm guessing is:

  1. Calculate the z-score for each row value.
  2. Average the resulting z-scores for each sample between the two gene samples

Thank you for your help.


Login before adding your answer.

Traffic: 1378 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6