I'm using the bioconductor package GSVA (http://bioconductor.org/packages/GSVA) to perform a GSVA analysis using RNA-seq data from GTEX and I'm confused about which dataset I need to use and if I need to perform some pre-processing steps.
According to the GSVA package vignette the input should be RNA-seq counts, does this means that I need to use the file "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz" instead of "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz"? and should I do something before use it as input?
The parameter for gsva() about which you need to be aware is kcdf. kcdf can have the values "Gaussian" or "Poisson". If your input data is normalised RNA-seq counts, then choose "Poisson", i.e., Poisson distribution. If your data is regularised log, variance-stabilised, or other logged data, including microarray normalised expression values, then choose "Gaussian", i.e. Gaussin distribution.
You can check the distribution of your data with the hist() command.
If you are planning to use the RPKM counts, then choose "Poisson". First ensure that these are not logged by generating the histogram.
I previously used GSVA for GTEx data. I took the raw counts (_gene_reads.gct.gz) and normalised these in DESeq2. I then transformed them via regularised log transformation and used kcdf="Gaussian".