Question

GSVA kernels: Gaussian or Poisson?

2

Entering edit mode

3.8 years ago

psm ▴ 130

Hi all, this question has indirectly come up several times. What is the best kcdf setting to use for GSVA analysis on non-log or non-variance normalized TPM data?

For GSVA analysis using RNAseq data, the GSVA manual states:

"We calculate now GSVA enrichment scores for these gene sets using first the microarray data and then the RNA-seq integer count data. Note that the only requirement to do the latter is to set the argument kcdf="Poisson" which is "Gaussian" by default.Note, however, that if our RNA-seq derived expression levels would be continous, such as log-CPMs, log-RPKMs or log-TPMs, the the default value of the kcdf argument should remain unchanged.

I assume that non-variance normalized TPM data should be treated by using the "Poisson" argument. However, following length normalization, most TPM data ends up as non-integer. I realize that this is the result of a linear transformation so the underlying structure of the data is unchanged, but according to the manual, it appears to be implied that the Gaussian setting may be appropriate for non-integer data, which includes non-variance normalized TPM.

I clearly don't understand the nuances of this setting, but wondering what other people's thoughts/suggestions/explanations are on this topic. For now, I'm just performing log1p on my TPM data and using the Gaussian argument, which runs much faster.

RNA-Seq • 2.3k views

ADD COMMENT • link updated 3.8 years ago by Kevin Blighe 87k • written 3.8 years ago by psm ▴ 130

score 1 · Answer 1 · 2020-07-04

1

Entering edit mode

3.8 years ago

Kevin Blighe 87k

It really does just depend on the distribution of the input data, which you already appear to understand. The default of Gaussian is set thus due to the fact that most downstream datasets that are using GSVA will have already been normalised and transformed to a Gaussian. If I were using FPKM, RPKM, or just 'normalised' RNA-seq counts, on the other hand, I would use Poisson.

So, check via histogram and other summary metrics to verify the distribution on which your data is measured.

Kevin

ADD COMMENT • link 3.8 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you Kevin - appreciate the response.

ADD REPLY • link 3.8 years ago by psm ▴ 130