Esembl geneID with Characters
4 months ago
Jakpa ▴ 50

Hi,

I have a df of gene expression that looks like this:

I want to map the Esembl Id with Gene name/Symbol using org.Hs.eg.db with this code:

res_df$symbol = mapIds(org.Hs.eg.db, keys = rownames(res_df), keytype = "ENSEMBL", column = "SYMBOL")  i got this error: Error in .testForValidKeys(x, keys, keytype, fks): None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments  Though, I saw similar post which relate to the decimal towards the end of the esembl Idand i tried fixing it with : res_df=gsub("\\..*","",row.names(res_df)) it did not give the required output. Then I realized that Esemble Id column does not have a name. I tried to name it like this names(res_df)[0] <- "EsemblId", but the output remain same. Now, I have more than 50,000 rows . How do I write a code in R to remove the decimal and the numbers after it e.i, Esembl Id? I think if am able to do that, my first code will work well based on previous post that I read. Regards, Esembl annotation GeneExpression R • 439 views ADD COMMENT 1 Entering edit mode 4 months ago rownames(res_df) <- gsub("\\.[0-9]+$", "", rownames(res_df))


Or if you prefer the tidyverse

library("stringr")

rownames(res_df) <- str_remove(rownames(res_df), "\\.[0-9]+\$")

rpolicastro , Thank you for your response. your code seems to work. but, its like the output time validity. I noticed that after few minutes of getting the output that i want, if I run it again, it will give error like this

res= mapIds(org.Hs.eg.db, keys = rownames(res),
keytype = "ENSEMBL", column = "SYMBOL",
multiVals = "first")


select()' returned 1:many mapping between keys and columns

then, this output:

ENSG00000000003'TSPAN6'ENSG00000000005'TNMD'ENSG00000000419'DPM1'ENSG00000000457'SCYL3'ENSG00000000460'C1orf112'ENSG00000000938'FGR'ENSG00000000971'CFH'ENSG00000001036'FUCA2'ENSG00000001084'GCLC'ENSG00000001167'NFYA'ENSG00000001460'STPG1'ENSG00000001461'NIPAL3'ENSG00000001497'LAS1L'ENSG00000001561'ENPP4'ENSG00000001617'SEMA3F'ENSG00000001626'CFTR'ENSG00000001629'ANKIB1'

instead of genesymbol as column with other variables like pvalue, Log2Foldchange etc. also, majority of the genesymbol are NAs

Please, any idea on how to resolve this?

Most genes as annotated by Ensembl do not have gene symbols, so when you fetch them, the NAs effectively mean "this gene does not have a name".