Question

Converting Affymetrix Probeset Ids To Symbols Or Ensembl Ids

2

Entering edit mode

11.1 years ago

user1409015 ▴ 20

I have 22268 Affymetrix Probeset IDs as the rownames for my expression matrix. I want to map these to the official HUGO gene symbols. However, when I use the hgu133plus2.db and annotate packages to do this with the call

symbols <- getSYMBOL(as.character(expression.matrix[,1]), "hgu133plus2")
rownames(expression.matrix) <- as.character(symbols)

I get the following error

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : duplicate 'row.names' are not allowed
In addition: Warning message: non-unique values when setting 'row.names': ‘AAGAB’, ‘AAK1’, ‘AASDHPPT’, ‘AASS’, ‘ABAT’, ‘ABCA1’, ‘ABCA2’, ‘ABCB11’, ‘ABCB6’, ‘ABCB9’, ‘ABCC1’, ‘ABCC10’, ‘ABCC3’, ‘ABCC6’, ‘ABCC8’, ‘ABCC9’, ‘ABCD1’, ‘ABCD4’, ‘ABCE1’, ‘ABCF2’, ‘ABCG1’, ‘ABHD2’, ‘ABHD5’, ‘ABHD6’, ‘ABI1’, ‘ABI2’, ‘ABLIM1’, ‘ABO’, ‘ABR’, ‘ACAA1’, ‘ACAA2’, ‘ACACA’, ‘ACACB’, ‘ACADL’, ‘ACAN’, ‘ACAP1’, ‘ACAP2’, ‘ACBD3’, ‘ACE2’, ‘ACHE’, ‘ACLY’, ‘ACO2’, ‘ACOT11’, ‘ACOT7’, ‘ACOX1’, ‘ACOX3’, ‘ACP1’, ‘ACRV1’, ‘ACSBG1’, ‘ACSL1’, ‘ACSL3’, ‘ACSL6’, ‘ACSM3’, ‘ACSM5’, ‘ACTA2’, ‘ACTB’, ‘ACTG1’, ‘ACTL6B’, ‘ACTN1’, ‘ACTN2’, ‘ACTR1A’, ‘ACTR2’, ‘ACTR3’, ‘ACTR5’, ‘ACVR1B’, ‘ADA’, ‘ADAM10’, ‘ADAM12’, ‘ADAM17’, ‘ADAM19’, ‘ADAM20’, ‘ADAM22’, ‘ADAM23’, ‘ADAM [... truncated]

I know that this is because 1893 of the probesets are NA for the official HUGO gene symbol. Therefore, I want to know what the norm is for dealing with these genes: are they excluded or should I just retain the probeset name? Or should I use the Ensembl IDs? How can I do the latter? Please bear in mind that I am using an expression matrix and not the ExpressionSet object provided by Bioconductor. This is of necessity since what I am scripting needs to be understandable by a competent programmer who is not familiar with R and will certainly not be familiar with Bioconductor.

Also, should I convert the rownames of the expression matrix into factors or is it ok to keep them as a character vector.

microarray affymetrix r bioconductor • 9.7k views

ADD COMMENT • link written 11.1 years ago by user1409015 ▴ 20

0

Entering edit mode

My recollection is that 133 did not all map stringently to protein gene IDs anyway. Its over-or-double counting by ~2K for starters

ADD REPLY • link 11.1 years ago by cdsouthan ★ 1.9k

score 2 · Answer 1 · 2013-03-17

First:

I know that this is because 1893 of the probesets are NA for the official HUGO gene symbol

is only part of the answer. The error message indicates that all of the symbols listed map to 2 or more probeset IDs. For example:

symbols <- toTable(hgu133plus2SYMBOL)
table(subset(symbols, symbol == "AAGAB")$symbol)
# AAGAB 
#     3

Also be aware that "NA" is a HUGO gene symbol, which can be an issue if you accidentally convert NA to a character string in R.

Second: there is no norm for the case where probeset IDs do not map to gene symbol. As you suggested, you may choose to exclude them or use an alternative ID. You might be able to retrieve other IDs from hgu133plus2.db, for example:

entrez <- toTable(hgu133plus2ENTREZID) # 41920 rows
ens    <- toTable(hgu133plus2ENSEMBL)  # 43892 rows

Third: row names are characters by default and should be left that way. I don't know that it's even possible to coerce them into factors. You may want to consider using a column in your matrix or data frame for gene symbols, rather than using the row names.