Question

GVSA: problem with gene identifier

0

Entering edit mode

7.0 years ago

cmgraef ▴ 10

Hi,

I'm currently trying to calculate a signature score for immune cells in the tumor environment using RNAseq data from the TCGA database. To do so, I'm using the TCGAbiolinks and GVSA package:

query.gbm <- GDCquery(project = "TCGA-GBM",
                 data.category = "Gene expression",
                 data.type = "Gene expression quantification",
                 platform = "Illumina HiSeq", 
                 file.type  = "normalized_results",
                 experimental.strategy = "RNA-Seq",
                 legacy = TRUE)

data <- GDCprepare(query.gbm)
gene_expression <- assay(data)

mic <- c("Epb41l5","Ysk4","Fam72a","Cdk18","Adora1","Olfml2b","Slamf1","Sccpdh","1110021L09Rik","Ank3","Kitl","Pmel","Sec14l2","Snap47","Zfp867","Fam83g","Rab34","Sarm1","Leprel4","Hsd17b1","Plekhh3","Etv4","Rundc3a","Ccdc46","Apob","Wdr35", ...)
genelist_mic <- list(mic)

score <- gsva(gene_expression, genelist_mic,
               method= "ssgsea",
               rnaseq=TRUE,
               abs.ranking=FALSE,
               min.sz=1,
               max.sz=Inf,
               no.bootstraps=0,
               bootstrap.percent = .632,
               parallel.sz=0,
               parallel.type="SOCK",
               mx.diff=TRUE,
               tau=switch(method, gsva=1, ssgsea=0.25, NA),
               kernel=TRUE,
               ssgsea.norm=TRUE,
               verbose=TRUE)

Everything seems to work just fine. I get an expression matrix with gene ids in rows and samples in the corresponding columns but I always get the following error message for the gsva() function:

Error in .local(expr, gset.idx.list, ...) : No identifiers in the gene sets could be matched to the identifiers in the expression data.

If I adjust my gene list (genelist_mic) to some of the gene identifiers from my gene expression matrix (gene_expression) that I get from TCGA, the code works. I also checked the gene identifiers of both the gene expression matrix and my cell type specific gene list with the David tool and it looks like they are all Gene Official Symbols. So I don't understand why it doesn't recognize the gene identifiers.

The gene identifiers from the gene expression matrix that I get back from TCGA look like this:

A1BG A2M NAT1 NAT2 RP11-986E7.7 SERPINA3 AADAC AAMP AANAT AARS ABAT ABCA1 ABCA2 ABCA3 ABCB7 ABCF1

Does anyone know why this might happen?

I would highly appreciate your help!

Thanks, Moritz

gvsa RNA-Seq gsea • 8.5k views

ADD COMMENT • link updated 3.2 years ago by 16622080827 • 0 • written 7.0 years ago by cmgraef ▴ 10

0

Entering edit mode

Hi Mike,

thanks for your fast reply! So I downloaded the data, checked again and noticed that the gene identifiers in the gene expression matrix were capital letters as opposed to my genelist. So I converted my genelist to all capital letters but now I get the following error message, which I don't understand:

'Error in FUN(X[[i]], ...) : object 'method' not found

These are the steps before the error message occurs:

Estimating GSVA scores for 1 gene sets.

Computing observed enrichment scores

Estimating ECDFs in rnaseq data with Poisson kernels

Using parallel with 4 cores

ADD REPLY • link 7.0 years ago by cmgraef ▴ 10

1

Entering edit mode

Where do you receive this messages? What does the traceback() function says? Have you searched in the support of Bioconductor (Where this question should probably be)?

ADD REPLY • link 7.0 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

So this is what my traceback() function returns after calculating for about 10 minutes:

> 8: FUN(X[[i]], ...) 
> 
> 7: lapply(X, FUN, ...)
> 
> 6: mclapp(gset.idx.list,
> ks_test_m, gene.density = rank.scores, 
>        sort.idxs = sort.sgn.idxs, mx.diff = mx.diff, tau = tau, 
>        verbose = verbose) 
> 
> 5: compute.geneset.es(expr, gset.idx.list, 1:n.samples, rnaseq =
> rnaseq, abs.ranking = abs.ranking, parallel.sz = parallel.sz, parallel.type = parallel.type, mx.diff = mx.diff, tau = tau, 
> kernel = kernel, verbose = verbose)
>
 4: .gsva(expr, mapped.gset.idx.list, method, rnaseq,
> abs.ranking, 
>        no.bootstraps, bootstrap.percent, parallel.sz, parallel.type, 
>        mx.diff, tau, kernel, ssgsea.norm, verbose) 
> 
3: .local(expr, gset.idx.list, ...)
> 
> 2: gsva(gene_expression, genelist_mic, rnaseq =
> TRUE, abs.ranking = FALSE, 
>        min.sz = 1, max.sz = Inf, no.bootstraps = 0, bootstrap.percent = 0.632, 
>        parallel.sz = 0, parallel.type = "SOCK", mx.diff = TRUE, 
>        tau = switch(method, gsva = 1, ssgsea = 0.25, NA), kernel = TRUE, 
>        ssgsea.norm = TRUE, verbose = TRUE) 
> 
> 1: gsva(gene_expression, genelist_mic, rnaseq = TRUE, abs.ranking =
> FALSE, min.sz = 1, max.sz = Inf, no.bootstraps = 0, bootstrap.percent = 0.632, parallel.sz = 0, parallel.type = "SOCK", 
> mx.diff = TRUE, tau = switch(method, gsva = 1, ssgsea = 0.25, NA),
> kernel = TRUE, ssgsea.norm = TRUE, verbose = TRUE)

Yes, I did and unfortunately I wasn't able to find anything. Thanks for the note though, if I can't solve the problem with this post here, I will try it there.

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 7.0 years ago by cmgraef ▴ 10

0

Entering edit mode

the genelist_mic is a vector, a list or a GeneSetCollection? Altough the problem might be in mclapp not exporting the right namespace to the different sockets.

ADD REPLY • link 7.0 years ago by Lluís R. ★ 1.2k

1

Entering edit mode

Agree with @Lluís R point on the support of Bioconductor.

Anyways, I think there is a problem in reading geneset file. Try with gmt format.

How many geneset are you using ?

ADD REPLY • link 7.0 years ago by Mike ★ 1.9k

0

Entering edit mode

Why do you think it's not working with a matrix? I have 172 datasets with around 20,000 genes each. I just don't understand it because calculating the z-scores with the gsva() function and rnaseq=FALSE works perfectly. It also calculates for around 10 minutes and stops then.

So here it what I tried:

> gmt_gene_expression <- getGmt(gene_expression)
Error in readLines(con, ...) : 'con' is not a connection

I guess that is a problem because gene_expression is a matrix and getGmt expects a character vector. But how would I solve that?

ADD REPLY • link 7.0 years ago by cmgraef ▴ 10

0

Entering edit mode

read geneset file in getGmt, not expression_matrix.

genelist_mic <- getGmt(geneset.gmt)

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 7.0 years ago by Mike ★ 1.9k

0

Entering edit mode

Hi,

How is your problem solved? I met the same issue and I have check the row name of my expression data and the ids in the gene set"gmt" file. They are the same in format. Most puzzling，I use the same code and same data several days ago but it just didn't work when I try it yesterday and today. I would highly appreciate your help.

Clover very puzzling!

ADD REPLY • link 3.2 years ago by 16622080827 • 0

0

Entering edit mode

Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply instead as appropriate.

ADD REPLY • link 3.2 years ago by Ram 43k

score 2 · Answer 1 · 2017-04-18

I also observed same in my cases, so name of gene in gene list (genelist_mic) and your expression matrix should be same. I mean gene "Ysk4" in genelist_mic should be same as "Ysk4" in expression matrix. And one more thing that some TCGA data analysis package automatically convert official gene name, eg, MAP3K19 is for Ysk4. so please check your data matrix that this gene is same as your input (Ysk4) or it is converted into MAP3K19.