AnnotationHub::mapIds() cannot find existing ENSG (GEO supplemental data cross-referenced with ensembl.org)
1
0
Entering edit mode
3.0 years ago
mk ▴ 300

Anyone know why I'm not getting ENSG ids for some of these symbols?

The example below retrieves NA for multiple symbols, including AAED1 whose ENSG is ENSG00000158122.

> library(AnnotationHub)
> library(org.Hs.eg.db)
> library(GEOquery)
> temp <- tempfile()
> download.file(getGEO("GSM4430459")@header$supplementary_file_1,temp)
> genes <- read.csv(temp)$X
> unlink(temp)
> ensids = mapIds(org.Hs.eg.db,
                keys=genes, 
                column="ENSEMBL",
                keytype="SYMBOL",
                multiVals="first")
> ensids["AAED1"] # here is one of the <NA>
AAED1 
NA
ensg annotationhub mapping • 1.6k views
ADD COMMENT
3
Entering edit mode
3.0 years ago

Hi, a quick check on NCBI Gene reveals that the official symbol for this is PRXL2C, not AAED1. In this way, I would not have expected org.Hs.eg.db (using 'recent' annotation) to have it. However, I can see that EnsDb.Hsapiens.v86 (older version) does [have it]. So, there must have been an annotation change in the recent Ensembl versions. Important to remember that gene annotation is constantly changing.

org.Hs.eg.db

library(org.Hs.eg.db)
select(org.Hs.eg.db,
  keys = 'AAED1',
  column = c('ENSEMBL', 'SYMBOL'),
  keytype = 'SYMBOL')

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.

EnsDb.Hsapiens.v86

library(EnsDb.Hsapiens.v86)
select(EnsDb.Hsapiens.v86,
  keys = 'AAED1',
  column = c('GENEID', 'SYMBOL'),
  keytype = 'SYMBOL')

           GENEID SYMBOL
1 ENSG00000158122  AAED1

------------

If we instead check for the official symbol, PRXL2C, in org.Hs.eg.db:

select(org.Hs.eg.db,
  keys = 'PRXL2C',
  column = c('ENSEMBL', 'SYMBOL'),
  keytype = 'SYMBOL')

  SYMBOL         ENSEMBL
1 PRXL2C ENSG00000158122

----------

In situations like this, one can use limma's alias2SymbolTable() to help retrieve all aliases for your genes.

limma::alias2SymbolTable('AAED1', species = 'Hs')
[1] "PRXL2C"

This simple example also highlights why it's better to use Ensembl or Entrez gene IDs for analyses.

Kevin

ADD COMMENT
1
Entering edit mode

Thanks, Kevin! That worked perfectly. As always, your clear and thoughtful answer is much-appreciated.

Yeah, I definitely think it's easier to work in unique identifiers and usually only convert to symbol for reporting. Then I happen to need this published data and figure I'll just use a GEO supplementary file to "save time"... Ends up taking 10x longer than just re-quantifying their raw SRA data in Salmon...

ADD REPLY
1
Entering edit mode

the alias2SymbolTable approach you mentioned would also be useful in the following situation :

  • the GEO data set is human transcriptomic sequencing and access to the SRA is controlled
  • the authors have posted the quantitated data in their supplement (public access)
  • like the above case, the symbols correspond to an unknown annotation version
ADD REPLY

Login before adding your answer.

Traffic: 2409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6