Question: Converting Affymetrix Probeset Ids To Symbols Or Ensembl Ids
2
gravatar for user1409015
6.7 years ago by
user140901520
user140901520 wrote:

I have 22268 Affymetrix Probeset IDs as the rownames for my expression matrix. I want to map these to the official HUGO gene symbols. However, when I use the hgu133plus2.db and annotate packages to do this with the call

symbols <- getSYMBOL(as.character(expression.matrix[,1]), "hgu133plus2")
rownames(expression.matrix) <- as.character(symbols)

I get the following error

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : duplicate 'row.names' are not allowed
In addition: Warning message: non-unique values when setting 'row.names': ‘AAGAB’, ‘AAK1’, ‘AASDHPPT’, ‘AASS’, ‘ABAT’, ‘ABCA1’, ‘ABCA2’, ‘ABCB11’, ‘ABCB6’, ‘ABCB9’, ‘ABCC1’, ‘ABCC10’, ‘ABCC3’, ‘ABCC6’, ‘ABCC8’, ‘ABCC9’, ‘ABCD1’, ‘ABCD4’, ‘ABCE1’, ‘ABCF2’, ‘ABCG1’, ‘ABHD2’, ‘ABHD5’, ‘ABHD6’, ‘ABI1’, ‘ABI2’, ‘ABLIM1’, ‘ABO’, ‘ABR’, ‘ACAA1’, ‘ACAA2’, ‘ACACA’, ‘ACACB’, ‘ACADL’, ‘ACAN’, ‘ACAP1’, ‘ACAP2’, ‘ACBD3’, ‘ACE2’, ‘ACHE’, ‘ACLY’, ‘ACO2’, ‘ACOT11’, ‘ACOT7’, ‘ACOX1’, ‘ACOX3’, ‘ACP1’, ‘ACRV1’, ‘ACSBG1’, ‘ACSL1’, ‘ACSL3’, ‘ACSL6’, ‘ACSM3’, ‘ACSM5’, ‘ACTA2’, ‘ACTB’, ‘ACTG1’, ‘ACTL6B’, ‘ACTN1’, ‘ACTN2’, ‘ACTR1A’, ‘ACTR2’, ‘ACTR3’, ‘ACTR5’, ‘ACVR1B’, ‘ADA’, ‘ADAM10’, ‘ADAM12’, ‘ADAM17’, ‘ADAM19’, ‘ADAM20’, ‘ADAM22’, ‘ADAM23’, ‘ADAM [... truncated]

I know that this is because 1893 of the probesets are NA for the official HUGO gene symbol. Therefore, I want to know what the norm is for dealing with these genes: are they excluded or should I just retain the probeset name? Or should I use the Ensembl IDs? How can I do the latter? Please bear in mind that I am using an expression matrix and not the ExpressionSet object provided by Bioconductor. This is of necessity since what I am scripting needs to be understandable by a competent programmer who is not familiar with R and will certainly not be familiar with Bioconductor.

Also, should I convert the rownames of the expression matrix into factors or is it ok to keep them as a character vector.

ADD COMMENTlink modified 5.3 years ago by aodaaodaslman20110 • written 6.7 years ago by user140901520

My recollection is that 133 did not all map stringently to protein gene IDs anyway. Its over-or-double counting by ~2K for starters

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by cdsouthan1.8k
2
gravatar for Neilfws
6.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

First:

I know that this is because 1893 of the probesets are NA for the official HUGO gene symbol

is only part of the answer. The error message indicates that all of the symbols listed map to 2 or more probeset IDs. For example:

symbols <- toTable(hgu133plus2SYMBOL)
table(subset(symbols, symbol == "AAGAB")$symbol)
# AAGAB 
#     3

Also be aware that "NA" is a HUGO gene symbol, which can be an issue if you accidentally convert NA to a character string in R.

Second: there is no norm for the case where probeset IDs do not map to gene symbol. As you suggested, you may choose to exclude them or use an alternative ID. You might be able to retrieve other IDs from hgu133plus2.db, for example:

entrez <- toTable(hgu133plus2ENTREZID) # 41920 rows
ens    <- toTable(hgu133plus2ENSEMBL)  # 43892 rows

Third: row names are characters by default and should be left that way. I don't know that it's even possible to coerce them into factors. You may want to consider using a column in your matrix or data frame for gene symbols, rather than using the row names.

ADD COMMENTlink written 6.7 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 847 users visited in the last hour