Question

How to use GSVA for the RNA-seq RPKM file

2

Entering edit mode

7.1 years ago

lhaiyan3 ▴ 80

Dear all,

I am trying to use GSVA for the RNA-seq analysis. I have the refseq annotation based RPKM files for the RNA-seq.

RPKM <- read.table("test.txt",header=TRUE,row.names=1,sep="\t")
#C2 collection of curated gene sets that form part of the Molecular Signatures Database (MSigDB) version 3.0.
data(c2BroadSets)
gene.sets <- GeneSetCollection(c2BroadSets)
#calculate GSVA enrichment scores
enrichment.scores <- gsva(RPKM, gene.sets, method="gsva", mx.diff=TRUE, verbose=TRUE, rnaseq=FALSE, parallel.sz=8)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘gsva’ for signature ‘"data.frame", "GeneSetCollection"’

Can anyone please give me some suggestions how to use GSVA for my dataanalysis? Thanks very much.

best

HY

RNA-Seq • 14k views

ADD COMMENT • link updated 3.6 years ago by Kevin Blighe 89k • written 7.1 years ago by lhaiyan3 ▴ 80

0

Entering edit mode

Hi! I would like to ask here why you use the p.value to discriminate between significant and non-significant results? What about the adjusted p-value?

ADD REPLY • link 4.3 years ago by jaraoshenriquez • 0

0

Entering edit mode

Hi, adjusted p-values are being used. Please read the documentation for topTable() and decideTests()

ADD REPLY • link 3.8 years ago by Kevin Blighe 89k

Kevin Blighe · Answer 1 · 2017-12-29

6

Entering edit mode

7.1 years ago

Kevin Blighe 89k

Nota Bene

If you must use RPKM / FPKM data for any downstream analyses, I would transform these to Z-scale first with zFPKM package, and then use 'gaussian' kernel with gsva() (depending on your version of gsva)
you can still use RPKM / FPKM with 'poisson' kernel

See: GSVA kernels: Gaussian or Poisson?

A recent protocol that I did for a simple disease versus control comparison:

#Perform GSVA   
topMatrixGSVA <- gsva(topMatrix, c2BroadSets, min.sz=10, max.sz=999999, abs.ranking=FALSE, verbose=TRUE)
design <- model.matrix(~ factor(metadata$DiseaseStatus, levels=c("Control", "Disease")))
colnames(design) <- c("Intercept", "DiseaseVsControl")
fit <- lmFit(topMatrixGSVA, design)
fit <- eBayes(fit)
sigPathways <- topTable(fit, coef="DiseaseVsControl", number=Inf, p.value=0.05, adjust="BH")
sigPathways <- sigPathways[abs(sigPathways$logFC)>1,]
res <- decideTests(fit, p.value=0.05)

#Create an object that can easily be written to disc
wObject <- data.frame(rownames(sigPathways), sigPathways)
colnames(wObject) <- c("Pathway","Log2FoldChange","MeanExpression","tStat","Pvalue","AdjustedPval","Bvalue")

#Filter the GSVA object to only include significant pathways
topMatrixGSVA <- topMatrixGSVA[rownames(sigPathways),]

#Set colour for heatmap
require(RColorBrewer)
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-1.5, 1.5, length.out=101)
heat <- t(scale(t(topMatrixGSVA)))

heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=TRUE, keysize=1.0, key.title="", key.xlab="Enrichment Z-score", scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=1.0, cexCol=1.0, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"), margin=c(10,25))

ADD COMMENT • link 3.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Kevin:

Thanks very much for your quick reply. I followed your suggestions and got the following wrong message again, can you please help me check?

gsva.scores <- gsva(data.matrix(RPKM), gene.sets, method="gsva", mx.diff=TRUE, verbose=TRUE, rnaseq=TRUE, parallel.sz=8) Error in .local(expr, gset.idx.list, ...) : No identifiers in the gene sets could be matched to the identifiers in the expression data. In addition: Warning message: In .local(expr, gset.idx.list, ...) : The argument 'rnaseq' is deprecated and will be removed in the next release of GSVA. Please use the 'kcdf' argument instead.

Thanks very much.

best

HY

ADD REPLY • link 7.1 years ago by lhaiyan3 ▴ 80

1

Entering edit mode

You have to convert gene symbols into Entrez IDs. In this code (below), topMatrix contains my expression values for which I wish to perform GSVA, with the rownames of topMatrix being the gene names that I wish to convert to Entrez IDs. I eliminate any genes that do not have a corresponding Entrez ID.

require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annots <- getBM(mart=mart, attributes=c("hgnc_symbol", "entrezgene"), filter="hgnc_symbol", values=rownames(topMatrix), uniqueRows=TRUE)
annots <- annots[!duplicated(annots[,1]),]
topMatrix <- topMatrix[which(rownames(topMatrix) %in% annots[,1]),]
annots <- annots[which(annots[,1] %in% rownames(topMatrix)),]
topMatrix <- topMatrix[match(annots[,1], rownames(topMatrix)),]
rownames(topMatrix) <- annots[,2]

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Kevin:

Thanks so much for your script. I followed your biomaRt script and now I can get the gsva enrichment score. Sorry I still have another question about how to get the heatmap. Your protocol is about the disease vs control sample comparison. We just have 16 samples in the dataset, no comparison, I also don't have metadata for them. I tried to run my data according to your heatmap script, but I don't think my result is right. If you don't mind, can you please help me check what is wrong with my data? I can send my data to you by email. Sorry i don't have strong background with R, it is hard for me to follow your protocol.

topMatrixGSVA <- gsva(data.matrix(topMatrix), gene.sets, method="gsva", mx.diff=TRUE, verbose=TRUE, rnaseq=TRUE, parallel.sz=8)
Estimating GSVA scores for 1254 gene sets.
Computing observed enrichment scores
Estimating ECDFs in microarray data with Gaussian kernels
Allocating cluster
Estimating enrichment scores in parallel
Taking diff of max KS.
Cleaning up
Warning message:
In .local(expr, gset.idx.list, ...) :
  The argument 'rnaseq' is deprecated and will be removed in the next release of GSVA. Please use the 'kcdf' argument instead.

fit <- lmFit(topMatrixGSVA)
fit <- eBayes(fit)
sigPathways <- topTable(fit, adjust="BH")
head(sigPathways)

                                                       logFC    AveExpr         t   P.Value adj.P.Val         B
LIU_SOX4_TARGETS_UP                               -0.1382600 -0.1382600 -1.288334 0.2096542 0.9986813 -4.522828
BRUECKNER_TARGETS_OF_MIRLET7A3_UP                 -0.1852650 -0.1852650 -1.261956 0.2188365 0.9986813 -4.535047
WATANABE_RECTAL_CANCER_RADIOTHERAPY_RESPONSIVE_UP  0.1165940  0.1165940  1.239462 0.2269064 0.9986813 -4.545307
RHEIN_ALL_GLUCOCORTICOID_THERAPY_UP               -0.1833832 -0.1833832 -1.157445 0.2582413 0.9986813 -4.581438
MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA                 -0.1833832 -0.1833832 -1.157445 0.2582413 0.9986813 -4.581438
SHEPARD_BMYB_MORPHOLINO_DN                        -0.1833832 -0.1833832 -1.157445 0.2582413 0.9986813 -4.581438

#Create an object that can easily be written to disc
wObject <- data.frame(rownames(sigPathways), sigPathways)

colnames(wObject) <- c("Pathway","Log2FoldChange","MeanExpression","tStat","Pvalue","AdjustedPval","Bvalue")
#Filter the GSVA object to only include significant pathways
topMatrixGSVA <- topMatrixGSVA[rownames(sigPathways),]

#Set colour for heatmap
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)

myBreaks <- seq(-1.5, 1.5, length.out=101)

heat <- t(scale(t(topMatrixGSVA)))

heatmap(heat, col=myCol, breaks=myBreaks, main="Title", key=TRUE, keysize=1.0, key.title="", key.xlab="Enrichment Z-score", scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=1.0, cexCol=1.0, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))
Error in .External.graphics(C_layout, num.rows, num.cols, mat, as.integer(num.figures),  : 
  invalid graphics state

Thanks very much for your great help.

best

HY

ADD REPLY • link updated 7.1 years ago by Kevin Blighe 89k • written 7.1 years ago by lhaiyan3 ▴ 80

2

Entering edit mode

Hi!

If you do not have any comparisons (like case versus control), then I believe you can produce the heatmap directly from the topMatrixGSVA object. The purpose of the steps using limma is to just identify pathways that are statistically differential between 2 conditions.

Are you, nevertheless, interested in seeing different sub-groups in your data based on pathway clustering? You may therefore be interested in adding metadata colour bars to your samples, and may take a look at, for example:

Figure 6 (page 13) of GSVA: The Gene Set Variation Analysis package for microarray and RNA-seq data

The following should function for you:

topMatrixGSVA <- gsva(data.matrix(topMatrix),
  gene.sets,
  method="gsva",
  mx.diff=TRUE,
  verbose=TRUE,
  rnaseq=TRUE,
  parallel.sz=8)

#Set colour and breaks for heatmap
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-1.5, 1.5, length.out=101)

heat <- t(scale(t(topMatrixGSVA)))

heatmap(heat,
  col=myCol,
  breaks=myBreaks,
  main="Title",
  key=TRUE,
  keysize=1.0,
  key.title="",
  key.xlab="Enrichment Z-score",
  scale="none", 
  density.info="none",
  reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
  trace="none",
  cexRow=1.0,
  cexCol=1.0,
  distfun=function(x) dist(x, method="euclidean"),
  hclustfun=function(x) hclust(x, method="ward.D2"))

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Kevin:

Thanks so much for your quick reply. I got the heatmap fig according to your script. I will following your suggestions to set up the different sub-group. You really help me a lot!

best

HY

ADD REPLY • link 7.1 years ago by lhaiyan3 ▴ 80

0

Entering edit mode

Okay, no problem!

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi,lhaiyan3,when you use the argument -kcdf,"Gaussian" and "Poisson" ,which one ought to be used for RNA-seq RPKM data?

ADD REPLY • link 6.7 years ago by 15324456297 • 0

0

Entering edit mode

RPKM more closely resembles Poisson. If you log it or convert to Z-scores with zFPKM package, then choose Gaussian. Note that simply logging RPKM/FPKM is not a great idea.

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Kevin: If I just use transcript counts of scRNA-seq for GSVA, should I log it or convert to Z-scores? I have thousands of cells and get several clusters. If I want to compare them to each other, can I use limma to identify pathways that are statistically different for each cluster. Thanks a lot!

ADD REPLY • link 6.3 years ago by penglinna7 • 0

1

Entering edit mode

Yes, GSVA (via Limma) allows you to compare pathways / gene signatures between groups of samples. It will likely not work with 1 Vs. 1 comparisons, though.

GSVA by default will expect a Gaussian distribution, so, your data should indeed be logged or Z-scaled. I believe you can change this default parameter, though. You can check all the parameters that are passed to GSVA to see if you can change the expected distribution.

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Sorry Kevin

RNA-seq input file for GSVA should be raw read counts, right?

Thank you for this this tutorial

ADD REPLY • link 3.8 years ago by zizigolu ★ 4.3k

1

Entering edit mode

For R / Bioconductor GSVA? - I am not sure that the authors anticipated that users would use raw counts, but it's not clear from the manual entry.

If you have raw counts, I would prefer to at least normalise these.

Note the kcdf parameter that is passed to gsva():

kcdf: Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when ‘method="gsva"’. By default, ‘kcdf="Gaussian"’ which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to ‘kcdf="Poisson"’.

I wrote about it here, too: GSVA on RNAseq data (GTEX)

ADD REPLY • link 3.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you so much Kevin

If we are only intrested in a defined list of pathways, let's say some of signalling pathways, in this case what is the solution please???

I have run your tutorial using c2BroadSets, I got almost pathways out of my intrest

Thank you in advance for any help

ADD REPLY • link 3.8 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Hi, maybe they mean to filter the pathways based on name. For example:

grep('signalling', names(c2BroadSets))

ADD REPLY • link 3.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Sorry you mean creating a new c2BroadSets? As a toy I placed a pathway in your script game me a number

> grep('YANG_BREAST_CANCER_ESR1_UP', names(c2BroadSets))
[1] 794
>

ADD REPLY • link 3.8 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Yes, you can do it this way, but also try 'signaling' (one letter 'L')

c2BroadSets.signalling <- c2BroadSets[[grep('signalling|signaling', 
names(c2BroadSets))]]

ADD REPLY • link 3.8 years ago by Kevin Blighe 89k