Question: GSVA on RNAseq data (GTEX)
gravatar for kakukeshi
8 months ago by
kakukeshi50 wrote:

I'm using the bioconductor package GSVA ( to perform a GSVA analysis using RNA-seq data from GTEX and I'm confused about which dataset I need to use and if I need to perform some pre-processing steps.

According to the GSVA package vignette the input should be RNA-seq counts, does this means that I need to use the file "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz" instead of "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz"? and should I do something before use it as input?


rna-seq R • 506 views
ADD COMMENTlink modified 8 months ago by Kevin Blighe51k • written 8 months ago by kakukeshi50
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Hey kakukeshi,

This is a good question.

The parameter for gsva() about which you need to be aware is kcdf. kcdf can have the values "Gaussian" or "Poisson". If your input data is normalised RNA-seq counts, then choose "Poisson", i.e., Poisson distribution. If your data is regularised log, variance-stabilised, or other logged data, including microarray normalised expression values, then choose "Gaussian", i.e. Gaussin distribution.

You can check the distribution of your data with the hist() command.

If you are planning to use the RPKM counts, then choose "Poisson". First ensure that these are not logged by generating the histogram.

I previously used GSVA for GTEx data. I took the raw counts (_gene_reads.gct.gz) and normalised these in DESeq2. I then transformed them via regularised log transformation and used kcdf="Gaussian".


ADD COMMENTlink written 8 months ago by Kevin Blighe51k

Hey Kevin, This is very helpful. Many thanks.

Following your suggestions what I did was to filter genes with low count first from the raw counts file (.gene_reads.gct.gz) using:

i.rm <- which(rowSums(subsetdat)<5000)
subsetdat <- subsetdat[-i.rm,]

Then I transformed the data to a DESeq object with:

subsetDES <- DESeqDataSetFromMatrix(countData = subsetdat,
                                  colData = ctab,
                                  design = ~ 1)

And finally rlog transform:

rld <- rlog(subsetDES)

Is this correct or there's something that I'm missing before use the data as input for gsva?

ADD REPLYlink modified 8 months ago • written 8 months ago by kakukeshi50

Hey, I believe you are missing one step just before the rlog() function, such as:

subsetDES <- DESeq(subsetDES)

Otherwise, you are performing rlog on the raw counts.

Also, your filter with rowSums() seems very high. You can try different filters, though. I usually go by mean <= 10 or 20.

ADD REPLYlink written 8 months ago by Kevin Blighe51k

Ups! Thanks Kevin. May I ask you if I should apply some type of filtering for the samples (individuals)?

ADD REPLYlink written 8 months ago by kakukeshi50

Hey, I have no major preference for filtering on samples. Usually a PCA bi-plot can ehlp to determine if a sample should be excluded or not. You could also look at similar metrics, like mean and max expression (per sample). Most people filter on the genes.

ADD REPLYlink written 8 months ago by Kevin Blighe51k

Just one last thing Kevin. I noticed that after the rlog step the samples are not exactly normally distributed (link below), so I scaled the values using gtex <- apply(gtex, 2, scale) but after doing that the results from the gsva are different. My question is: is it really meaningful to scale the values? or should I keep as they are after rlog

ADD REPLYlink modified 8 months ago • written 8 months ago by kakukeshi50

Hey, rlog never usually returns data that perfectly follows a normal distribution. In the past, I have often converted rlog data to Z-scale, like you have done. I think that this approach is fine.

apply(gtex, 2, scale) should produce the same output as t(scale(t(gtex))), I think, i.e., scaling by gene.

Edit: with apply(gtex, 2, scale), you are actually scaling each column. You may want to just try the t(scale(t(x))) approach.

Note that the scale function with default settings is essentially Z-transforming your data. It is very useful.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe51k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1755 users visited in the last hour