Question

Mapping mouse gene symbols to Entrez IDs in GAGE

0

Entering edit mode

8.5 years ago

bojingjia ▴ 10

I've come across many posts about common errors using GAGE, and many of these common pitfalls relate to mismatching ID systems (Entrez gene ID, gene symbol, etc). I've read the "Gene set and data preparation" vignette, but still get errors when I try to convert my gene symbols to Entrez IDs.

I have two questions:

Is there a way to map more "efficiently" gene symbols to Entrez IDs? For example, of 38720 unique input IDs, 8850 of my genes remain unmapped. I am using the mouse data set, trying to map gene symbols in my featureCounts output.
What does it really mean when I fail to download xml/png files for my GAGE analysis? I get errors like:

Info: Downloading xml files for hsammu04060, 1/1 pathways..
Warning: Download of hsammu04060 xml file failed!
This pathway may not exist!

Thanks in advance

RNA-Seq DESeq2 GSEA GAGE pathview • 5.9k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.5 years ago by bojingjia ▴ 10

0

Entering edit mode

## Load required libraries
library("DESeq2")
library("gage")
library("pathview")

## Combine count files into dataframe
# Import data from featureCounts
countdata <- read.table("wt_CEvsRT.txt", header=TRUE, row.names=1)

# Convert to matrix
countdata <- as.matrix(countdata)
head(countdata)

# Assign condition
sampleCondition <- c("RT", "RT", "RT", "CE", "CE", "CE")

# Analysis with DESeq2 ----------------------------------------------------
# Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
(coldata <- data.frame(row.names=colnames(countdata), sampleCondition))
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~sampleCondition)

## Run DESeq normalization
dds<-DESeq(dds)

##from GAGE

deseq2.res <- results(dds)
deseq2.fc=deseq2.res$log2FoldChange
names(deseq2.fc)=rownames(deseq2.res)
exp.fc=deseq2.fc
out.suffix="deseq2"

require(gage)
datakegg.gs)

#get the annotation files for mouse

kg.mouse<- kegg.gsets("mouse")
kegg.gs<- kg.mouse$kg.sets[kg.mouse$sigmet.idx]

#convert gene symbol to entrez ID

gene.symbol.eg<- id2eg(ids=names(exp.fc), category='SYMBOL', org='Mm')

names(exp.fc)<- gene.symbol.eg[,2]

fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
sel <- fc.kegg.p$greater[, "q.val"] < 0.2 & !is.na(fc.kegg.p$greater[, "q.val"])
path.ids <- rownames(fc.kegg.p$greater)[sel]
sel.l <- fc.kegg.p$less[, "q.val"] < 0.2 & !is.na(fc.kegg.p$less[,"q.val"])
path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
require(pathview)
#view first 3 pathways as demo
pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data = exp.fc, pathway.id = pid,species = "hsa", out.suffix=out.suffix))

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.5 years ago by bojingjia ▴ 10

1

Entering edit mode

I don't know if it is the cause of all your problems, but you should be using species = "mmu" on your pathview() call.

ADD REPLY • link 8.5 years ago by h.mon 35k

0

Entering edit mode

Thanks! That solved the errors. I am still unable to completely map all the gene symbols, do you have any suggestions?

ADD REPLY • link 8.5 years ago by bojingjia ▴ 10

0

Entering edit mode

No, I do not have any (easy) suggestions. In fact, the situation is probably worst, if you use org.Mm.eg.db and do:

gene.symbol.eg <- select(org.Mm.eg.db,keys=names(exp.fc),columns="ENTREZID", keytype="SYMBOL")

you will probably find a "1:many mapping", indicating some gene names have multiple IDs. See here and here for discussions and suggestions.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.5 years ago by h.mon 35k

score 0 · Answer 1 · 2015-11-20

id2eg use comprehensive gene annotation packages in Bioconductor. Almost all (if not all) official gene symbols can be mapped to Entrez Gene IDs this way. You should check that the unmapped gene symbols are “official”, as they might be synonyms or even other types of gene IDs, or transcript IDs. Having that said, there are ~30000 genes mapped in your data. Pathway analysis with that should still be very informative.

BTW, in for your error message, species = "mmu" is the solution. When species is not set, the default (hsa, i.e. human) will be used. Hence you get funny pathway names like hsammu04060, of couse, you are not able to download anything for these “pathways”.