Question: Annotation in ExpressionSet object
gravatar for Na Sed
3.9 years ago by
Na Sed290
United States
Na Sed290 wrote:

I have a matrix of gene expression which label of each sample and symbol of each gene is known. I wanna perform gtBroad in globaltest R-package. It needs to convert expression matrix to an ExpressionSet object with known Annotation. How I can do that while I have only the name of genes?

EDIT: You can download sample data from this link:

I wanna run the below code:

broadset <- getBroadSets(asBroadUri(c('BIOCARTA_HDAC_PATHWAY', 'PID_HDAC_CLASSII_PATHWAY')))
pData <- data.frame(id=resp, row.names=colnames(data))
phenoData <- AnnotatedDataFrame(data=pData)
exprs <- ExpressionSet(data, phenoData=phenoData)
gtBroad (response=id, exprs, id='PID_HDAC_CLASSIII_PATHWAY', collection=broadset)
expressionset R • 3.5k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Na Sed290
gravatar for Selenocysteine
3.9 years ago by
Dublin, Ireland
Selenocysteine600 wrote:

You can use this function:

The gene annotations (featureData) are optional, so you can get an ExpressionSet without any annotations. If you want to get some information about your genes, you can use the biomaRt package which is based on the Ensembl database. Here is the guide:

Basically, you can use your list of gene symbol to query the biomaRt database and obtain a set of data for these symbols (e.g. entrezID, chromosomal location, strand). The obtained dataset (featureData) will be an object of class R dataframe. According to the biobase guide to create ExpressionSet items, you then have to convert it to an AnnotatedDataFrame using the annotatedDataFrameFrom() function. Note that the number of rows in featureData must match the number of rows in assayData, and row names of featureData (gene symbols) must match row names of the matrix / matricies in assayData (gene symbols).

ADD COMMENTlink written 3.9 years ago by Selenocysteine600

Thanks, but the problem still is remained. I added sample data as well as my code to the question. It would be great if you take a look.

ADD REPLYlink written 3.9 years ago by Na Sed290

I tried fixing your code. The main issue is that to use the GSEA tool you need to know the EntrezIDs of your genes. First, you need to map your gene symbols to their EntrezIDs, and in your case it is not possible for around 2k genes, because they are not called with the official gene name, but with an alias. You have to find the official gene names for all of your genes and change the initial dataset, I think there are some tools on like the HGNC webpage. The rest is explained in the code.

library( # I used this database because I assumed we are dealing with Homo sapiens


# Takes the gene symbols of your expression data
geneSymbols = rownames(data)

# Takes the EntrezIDs of your genes. It gives an error because some of your genes
# are not called with the official gene symbol, but with a synonym.
# You probably must fix this manually, I just removed the rows of your dataset
# for which the annotation package wasn't able to map an EntrezID.
annotations = select(, geneSymbols, "ENTREZID", "SYMBOL")

# Removes the annotations for those genes for which it didn't find the EntrezID
annotations = annotations[!$ENTREZID),]

# Creates a named list to map the EntrezIDs of your genes and their symbols
annotationsVec = c(annotations$ENTREZID)
names(annotationsVec) = annotations$SYMBOL
annotationsVec = as.list(annotationsVec)

# Removes the rows corresponding to the genes without an EntrezID
data = data[rownames(data) %in% annotations$SYMBOL,]
# [1] 5325  
# You lost around 2k genes because they are not called with the official symbols
# and the annotation cannot map their symbols to the IDs. You have to do this with some 
# other tools which allow you to look for gene symbol aliases.

broadset = getBroadSets(asBroadUri(c('BIOCARTA_HDAC_PATHWAY', 'PID_HDAC_CLASSII_PATHWAY')))
phenoData = data.frame(id = resp, row.names = colnames(data))
phenoData = AnnotatedDataFrame(data = phenoData)

exprs = ExpressionSet(data, phenoData = phenoData)

# Since you are using general annotations for Homo sapiens, add the name of the annotation database ("")
# and then the annotations list (annotationsVec) which allows it to map the rows of your ExpressionSet to their EntrezIDs.
# Note: in your sample code you wrote the pathway ID wrong, so I fixed it.
gtBroad(response = id, exprs, id = 'PID_HDAC_CLASSII_PATHWAY', annotation = "", collection = broadset, probe2entrez = annotationsVec)
ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Selenocysteine600

Looks great. Thank you so much for your help.

ADD REPLYlink written 3.8 years ago by Na Sed290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour