Mapping symbols to ensembl Ids using mapIds() - returning NA, multiVals and what to do with Human Alternative sequence Genes
0
0
Entering edit mode
3.9 years ago
jack.henry ▴ 50

I am trying to run some fgsea on TCGA data using the genesets from the Molecular Signatures Database. I have downloaded the .gmt symbols file and then use mapIds() from AnnotationDbi to convert the symbols to ensembl ids that I have in the TCGA data.

1: My first problem is that rarely but sometimes mapIds returns NA to some genes and I am not sure why becuase when I search them on ensembl.org they do have an enembl id. Is this something to do with transcript ids? Is there a way to fix this? I can use gmtFile[gmtFile == "AC093012.1"] <- NA to temporarily get around the problem but I know this is not best practice and I would love it if somebody has a solution.

2: My other problem is that when I test wether the ensembl ids are in the TCGA data I sometimes find that they are not there. I have noticed that this usually happens when these genes are Human Alternative sequence Gene or at least have Human Alternative sequence Gene as well as the regular Human Gene. Again I can delete the gene from the geneset as it is not in the dataset but is this okay to do?

3: My final question is that mapIds() often returns 1:many mapping between keys and columns. I guess this is because the symbols will often have multiple ensembl id. I have been using multiVals = "first" to just get the first ensembl id for that gene, but is this okay or should i be extending the geneset to create extra genes for the multiple ensembl ids?

Examples of genes that mapIds() returns NA:

  • AC093012.1: ENSG00000257896
  • HBBP1: ENSG00000229988
  • MIR1-2: ENSG00000284453
  • MIR19B1: ENSG00000284375
  • MIR19B2: ENSG00000284107
  • MIR29B1: ENSG00000284203
  • MIR29B2: ENSG00000284203
  • MIR665: ENSG00000283159
  • SHLD2P3: ENSG00000189014
  • MEIS3P1: ENSG00000179277
  • C7ORF50: ENSG00000146540
  • C1ORF109: ENSG00000116922
  • C1ORF115: ENSG00000162817
  • CXORF38: ENSG00000185753
  • CSF2RBP1: ENSG00000232254
  • C1ORF174: ENSG00000198912
  • VENTXP7: ENSG00000236380
  • RBMS1P1: ENSG00000225422
  • FAM182B: ENSG00000175170
  • RBMY2AP: ENSG00000226092

Examples of genes that I cant find in the TCGA dataset:

  • HLA-DRB4: ENSG00000227357/ ENSG00000227826/ ENSG00000231021
  • HLA-DRB3: ENSG00000230463/ ENSG00000231679/ ENSG00000196101
  • C4B_2: ENSG00000233312
  • MUC2: ENSG00000198788 / ENSG00000278466/ ENSG00000284971

MapIds function:

library(org.Hs.eg.db)
library(AnnotationDbi)
mapIds(
          x = org.Hs.eg.db, 
          keys = currentgeneset, 
          "ENSEMBL", 
          "SYMBOL",
          fuzzy = TRUE,
          multiVals = "first")

I know these questions have been asked a lot on here but I can't seem to find the answers I'm after.

Thanks in advance for any help!!

RNA-Seq TCGA R ensembl gsea • 6.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 2041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6