Question: GVSA: problem with gene identifier
0
gravatar for cmgraef
3.0 years ago by
cmgraef10
cmgraef10 wrote:

Hi,

I'm currently trying to calculate a signature score for immune cells in the tumor environment using RNAseq data from the TCGA database. To do so, I'm using the TCGAbiolinks and GVSA package:

query.gbm <- GDCquery(project = "TCGA-GBM",
                 data.category = "Gene expression",
                 data.type = "Gene expression quantification",
                 platform = "Illumina HiSeq", 
                 file.type  = "normalized_results",
                 experimental.strategy = "RNA-Seq",
                 legacy = TRUE)

data <- GDCprepare(query.gbm)
gene_expression <- assay(data)

mic <- c("Epb41l5","Ysk4","Fam72a","Cdk18","Adora1","Olfml2b","Slamf1","Sccpdh","1110021L09Rik","Ank3","Kitl","Pmel","Sec14l2","Snap47","Zfp867","Fam83g","Rab34","Sarm1","Leprel4","Hsd17b1","Plekhh3","Etv4","Rundc3a","Ccdc46","Apob","Wdr35", ...)
genelist_mic <- list(mic)

score <- gsva(gene_expression, genelist_mic,
               method= "ssgsea",
               rnaseq=TRUE,
               abs.ranking=FALSE,
               min.sz=1,
               max.sz=Inf,
               no.bootstraps=0,
               bootstrap.percent = .632,
               parallel.sz=0,
               parallel.type="SOCK",
               mx.diff=TRUE,
               tau=switch(method, gsva=1, ssgsea=0.25, NA),
               kernel=TRUE,
               ssgsea.norm=TRUE,
               verbose=TRUE)

Everything seems to work just fine. I get an expression matrix with gene ids in rows and samples in the corresponding columns but I always get the following error message for the gsva() function:

Error in .local(expr, gset.idx.list, ...) : No identifiers in the gene sets could be matched to the identifiers in the expression data.

If I adjust my gene list (genelist_mic) to some of the gene identifiers from my gene expression matrix (gene_expression) that I get from TCGA, the code works. I also checked the gene identifiers of both the gene expression matrix and my cell type specific gene list with the David tool and it looks like they are all Gene Official Symbols. So I don't understand why it doesn't recognize the gene identifiers.

The gene identifiers from the gene expression matrix that I get back from TCGA look like this:

A1BG A2M NAT1 NAT2 RP11-986E7.7 SERPINA3 AADAC AAMP AANAT AARS ABAT ABCA1 ABCA2 ABCA3 ABCB7 ABCF1

Does anyone know why this might happen?

I would highly appreciate your help!

Thanks, Moritz

gsea rna-seq gvsa • 2.3k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 3.0 years ago by cmgraef10
2
gravatar for Mike
3.0 years ago by
Mike1.5k
UK
Mike1.5k wrote:

I also observed same in my cases, so name of gene in gene list (genelist_mic) and your expression matrix should be same. I mean gene "Ysk4" in genelist_mic should be same as "Ysk4" in expression matrix. And one more thing that some TCGA data analysis package automatically convert official gene name, eg, MAP3K19 is for Ysk4. so please check your data matrix that this gene is same as your input (Ysk4) or it is converted into MAP3K19.

ADD COMMENTlink written 3.0 years ago by Mike1.5k
0
gravatar for cmgraef
3.0 years ago by
cmgraef10
cmgraef10 wrote:

Hi Mike,

thanks for your fast reply! So I downloaded the data, checked again and noticed that the gene identifiers in the gene expression matrix were capital letters as opposed to my genelist. So I converted my genelist to all capital letters but now I get the following error message, which I don't understand:

'Error in FUN(X[[i]], ...) : object 'method' not found

These are the steps before the error message occurs:

Estimating GSVA scores for 1 gene sets.

Computing observed enrichment scores

Estimating ECDFs in rnaseq data with Poisson kernels

Using parallel with 4 cores

ADD COMMENTlink written 3.0 years ago by cmgraef10
1

Where do you receive this messages? What does the traceback() function says? Have you searched in the support of Bioconductor (Where this question should probably be)?

ADD REPLYlink written 3.0 years ago by Lluís R.890

So this is what my traceback() function returns after calculating for about 10 minutes:

8: FUN(X[[i]], ...)

7: lapply(X, FUN, ...)

6: mclapp(gset.idx.list, ks_test_m, gene.density = rank.scores, sort.idxs = sort.sgn.idxs, mx.diff = mx.diff, tau = tau, verbose = verbose)

5: compute.geneset.es(expr, gset.idx.list, 1:n.samples, rnaseq = rnaseq, abs.ranking = abs.ranking, parallel.sz = parallel.sz, parallel.type = parallel.type, mx.diff = mx.diff, tau = tau, kernel = kernel, verbose = verbose)

4: .gsva(expr, mapped.gset.idx.list, method, rnaseq, abs.ranking, no.bootstraps, bootstrap.percent, parallel.sz, parallel.type, mx.diff, tau, kernel, ssgsea.norm, verbose)

3: .local(expr, gset.idx.list, ...)

2: gsva(gene_expression, genelist_mic, rnaseq = TRUE, abs.ranking = FALSE, min.sz = 1, max.sz = Inf, no.bootstraps = 0, bootstrap.percent = 0.632, parallel.sz = 0, parallel.type = "SOCK", mx.diff = TRUE, tau = switch(method, gsva = 1, ssgsea = 0.25, NA), kernel = TRUE, ssgsea.norm = TRUE, verbose = TRUE)

1: gsva(gene_expression, genelist_mic, rnaseq = TRUE, abs.ranking = FALSE, min.sz = 1, max.sz = Inf, no.bootstraps = 0, bootstrap.percent = 0.632, parallel.sz = 0, parallel.type = "SOCK", mx.diff = TRUE, tau = switch(method, gsva = 1, ssgsea = 0.25, NA), kernel = TRUE, ssgsea.norm = TRUE, verbose = TRUE)

Yes, I did and unfortunately I wasn't able to find anything. Thanks for the note though, if I can't solve the problem with this post here, I will try it there.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by cmgraef10

the genelist_mic is a vector, a list or a GeneSetCollection? Altough the problem might be in mclapp not exporting the right namespace to the different sockets.

ADD REPLYlink written 3.0 years ago by Lluís R.890
1

Agree with @Lluís R point on the support of Bioconductor.

Anyways, I think there is a problem in reading geneset file. Try with gmt format.

How many geneset are you using ?

ADD REPLYlink written 3.0 years ago by Mike1.5k

Why do you think it's not working with a matrix? I have 172 datasets with around 20,000 genes each. I just don't understand it because calculating the z-scores with the gsva() function and rnaseq=FALSE works perfectly. It also calculates for around 10 minutes and stops then.

So here it what I tried:

> gmt_gene_expression <- getGmt(gene_expression)
Error in readLines(con, ...) : 'con' is not a connection

I guess that is a problem because gene_expression is a matrix and getGmt expects a character vector. But how would I solve that?

ADD REPLYlink written 3.0 years ago by cmgraef10

read geneset file in getGmt, Not expression_matrix.

genelist_mic <- getGmt(geneset.gmt)

ADD REPLYlink written 3.0 years ago by Mike1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1819 users visited in the last hour