Question: Mapping symbols to ensembl Ids using mapIds() - returning NA, multiVals and what to do with Human Alternative sequence Genes
I am trying to run some fgsea on TCGA data using the genesets from the Molecular Signatures Database. I have downloaded the .gmt symbols file and then use mapIds() from AnnotationDbi to convert the symbols to ensembl ids that I have in the TCGA data.

1: My first problem is that rarely but sometimes mapIds returns NA to some genes and I am not sure why becuase when I search them on they do have an enembl id. Is this something to do with transcript ids? Is there a way to fix this? I can use gmtFile[gmtFile == "AC093012.1"] <- NA to temporarily get around the problem but I know this is not best practice and I would love it if somebody has a solution.

2: My other problem is that when I test wether the ensembl ids are in the TCGA data I sometimes find that they are not there. I have noticed that this usually happens when these genes are Human Alternative sequence Gene or at least have Human Alternative sequence Gene as well as the regular Human Gene. Again I can delete the gene from the geneset as it is not in the dataset but is this okay to do?

3: My final question is that mapIds() often returns 1:many mapping between keys and columns. I guess this is because the symbols will often have multiple ensembl id. I have been using multiVals = "first" to just get the first ensembl id for that gene, but is this okay or should i be extending the geneset to create extra genes for the multiple ensembl ids?

Examples of genes that mapIds() returns NA:

  • AC093012.1: ENSG00000257896
  • HBBP1: ENSG00000229988
  • MIR1-2: ENSG00000284453
  • MIR19B1: ENSG00000284375
  • MIR19B2: ENSG00000284107
  • MIR29B1: ENSG00000284203
  • MIR29B2: ENSG00000284203
  • MIR665: ENSG00000283159
  • SHLD2P3: ENSG00000189014
  • MEIS3P1: ENSG00000179277
  • C7ORF50: ENSG00000146540
  • C1ORF109: ENSG00000116922
  • C1ORF115: ENSG00000162817
  • CXORF38: ENSG00000185753
  • CSF2RBP1: ENSG00000232254
  • C1ORF174: ENSG00000198912
  • VENTXP7: ENSG00000236380
  • RBMS1P1: ENSG00000225422
  • FAM182B: ENSG00000175170
  • RBMY2AP: ENSG00000226092

Examples of genes that I cant find in the TCGA dataset:

  • HLA-DRB4: ENSG00000227357/ ENSG00000227826/ ENSG00000231021
  • HLA-DRB3: ENSG00000230463/ ENSG00000231679/ ENSG00000196101
  • C4B_2: ENSG00000233312
  • MUC2: ENSG00000198788 / ENSG00000278466/ ENSG00000284971

MapIds function:

          x =, 
          keys = currentgeneset, 
          fuzzy = TRUE,
          multiVals = "first")

I know these questions have been asked a lot on here but I can't seem to find the answers I'm after.

Thanks in advance for any help!!

ensembl gsea rna-seq tcga R • 177 views
