GSVA R packages
1
1
Entering edit mode
14 months ago
Morris_Chair ▴ 230

Hello everyone, I'm trying to do a gene set varian analysis using R to detect a specific gene set signature of a specific pathway from 20 samples of RNA-seq. I have this files in BAM format but I don't know what to do in order to get a panel like this one in the pic! which R packages are required?

Screen-Shot-2020-04-09-at-9-55-49-AM

on the Y axis there is the score for a specific singling pathway while the x axis represents the different samples.

Thank you

GSVA • 1.3k views
ADD COMMENT
0
Entering edit mode

Hi Kevin,

I created the rlog matrix.txt and run GSVA, but I m doing something wrong, To run GSVA I m following this vignette:

https://bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf

library(GSEABase)
library(GSVAdata)
data(c2BroadSets)
c2BroadSets

canonicalC2BroadSets <-c2BroadSets[c(grep("^KEGG",names(c2BroadSets)),
                                     grep("^REACTOME", names(c2BroadSets)),
                                     grep("^BIOCARTA", names(c2BroadSets)))]

canonicalC2BroadSets
data(genderGenesEntrez)

MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),collectionType=BroadCollection(category="c2"), setName="MSY")

XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),collectionType=BroadCollection(category="c2"), setName="XiE")

canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))

mydata <- read.table(file= "mytable.csv",sep="",header=TRUE, row.names = 1)

esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
                 kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)

but at this point I have an error : Error in gsva(mydata, canonicalC2BroadSets, min.sz = 5, max.sz = 500, : could not find function "gsva"

I'm not familiar with this script, can you tell me how to use it?

Do I need of others libraries?

thank you very much

ADD REPLY
0
Entering edit mode

Hey Morris, I think that you also need to load the main package, like this:

require(GSVA)
ADD REPLY
2
Entering edit mode
14 months ago

Hello,

Your best approach would be this:

  1. produce a raw counts matrix for your BAM files (HT-seq, featureCounts, or something else)
  2. normalise the raw counts (EdgeR, DESeq2, etc)
  3. transform the normalised counts via, e.g., regularised log, variance stabilisation, or log2 (CPM + 1)
  4. Run GSVA with the 'C2' Broad Institute curated datasets

The output of GSVA will be what you can then use to generate bar plot for each enriched signature / pathway.

I will leave the specifics of each step to you.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, I have this error

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘gsva’ for signature ‘"data.frame", "GeneSetCollection"’

probably it's not right the format of the matrix that I'm using or I'm missing a step?

Thank you

ADD REPLY
0
Entering edit mode

Hey, perhaps it has to be a data matrix?

ADD REPLY
0
Entering edit mode

Hi Kevin, I converted the file from data.frame to a "as.matrix" ... but another error occurs

esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500, + kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)

Error in .local(expr, gset.idx.list, ...) : 
  Less than two genes in the input expression data matrix
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  55046 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

thank you for your help

ADD REPLY
0
Entering edit mode

Oh, here is the worrying part:

  55046 genes with constant expression values throughout the samples.

What if you do this (before any conversion to data matrix):

table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0)
table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0)

That will tabulate samples and then genes of 0 variance, i.e., constant values

Also, make sure that you retry gsva() with data.matrix(), not as.matrix()

Did you pre-filter your data?

ADD REPLY
0
Entering edit mode

I filtered the matrix by removing empty rows and rows with values assigned with duplicated gene names because they were giving me error as well.

Here is the output seems that the error persists

> mydata <- read.table(file= "mytable_genename.csv",sep="",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))
< table of extent 0 >
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))
< table of extent 0 >
> mydata<-data.matrix(mydata)


> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  Less than two genes in the input expression data matrix
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  55046 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

Thank you

ADD REPLY
0
Entering edit mode

Hey Morris, all good? Why is your separater an empty field - sep="" ? What is the delimiter used in your file, mytable_genename.csv ?

ADD REPLY
0
Entering edit mode

Hi Kevin, I m ok for now thanks :)

I thought that maybe separator could be a problem so I removed it, now because you told me I converted the file from .csv to .txt without specifying the separator and I have a different result:

> mydata <- read.table(file= "mytable_genename.txt",sep="",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))

FALSE 
   18 
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))

FALSE  TRUE 
50640  4406 
> mydata<-data.matrix(mydata)

> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  No identifiers in the gene sets could be matched to the identifiers in the expression data.
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  4406 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

Maybe I should be using the ensemble for the gene name..

ADD REPLY
0
Entering edit mode

Cool, oh, but, can you show the actual contents of the file? Using sep="", irrespective of the file extension, will cause problems.

How are the fields / columns separated inside the file? You can open it with Notepad or Gedit (Ubuntu), or some other text editor.

ADD REPLY
0
Entering edit mode

I loaded the file into dropbox, if it's ok for you, you can download it from here

https://www.dropbox.com/s/yqa8yeialsxrlme/mytable_genename.txt?dl=0

thank you

ADD REPLY
0
Entering edit mode

My time is quite limited - sorry. You literally just need to look in the file to see how the columns are separated. It is most likely a comma or tab-space. So, sep = ',' or sep = '\t' should work.

ADD REPLY
0
Entering edit mode

it was separated by tab-space but the result didn't changed after editing the code :/

Screen-Shot-2020-04-23-at-6-04-21-PM

> mydata <- read.table(file= "mytable_genename.txt",sep="\t",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))

FALSE 
   18 
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))

FALSE  TRUE 
50640  4406 

> mydata<-data.matrix(mydata)

> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  No identifiers in the gene sets could be matched to the identifiers in the expression data.
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  4406 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

thank you

ADD REPLY
0
Entering edit mode

Yes, the problem is that you have 4406 genes with 0 variance, so, there is nothing that can be done with these. You will have to remove them prior to performing GSVA. With the apply() function, you can basically create a vector of boolean values for TRUE and FALSE.

For example:

keep <- apply(mydata, 1, function(x) var(x, na.rm = TRUE) > 0

mydata<-data.matrix(mydata[keep,])

Now I am going out for a jog, while social distancing. Hope that it helps!

ADD REPLY
1
Entering edit mode

I will work on this, thank you, lucky you that you can jog, enjoy :)

ADD REPLY
0
Entering edit mode

Hello Kevin Blighe,

I have one more Question, In the GSVA help page it is suggested to use tokcdf="Poisson" for a suitable modelisation of RNA-seq (count Data)

By default,kcdf="Gaussian"which is suitable wheninput expression values are continuous,
such as microarray fluorescent units inlogarithmic scale,  RNA-seq  log-CPMs,  log-RPKMs or log-TPMs.
When in-put expression values are integer counts, such as those derived from RNA-seqexperiments, 
then this argument should be set tokcdf="Poisson"

but in your comment you suggest a 2 step normalisation of raw counts (DESeq2) and log transformation in the DESeq2 package we can do

counts(dds, normalized=TRUE) # normalised count

vsd <- vst(dds, blind=FALSE) # variance stabilizing transformation

rld <- rlog(dds, blind=FALSE) # regularized log' transformation and ntd <- normTransform(dds) # Normalized counts transformation

Can you explain which one to use ? Should I use a diagnostic plot ?

ADD REPLY
1
Entering edit mode

Hello, if you wish to use "Poisson", then you should use counts(dds, normalized=TRUE)

ADD REPLY

Login before adding your answer.

Traffic: 2240 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6