I am trying to run some fgsea on TCGA data using the genesets from the Molecular Signatures Database.
I have downloaded the .gmt symbols file and then use
mapIds() from AnnotationDbi to convert the symbols to ensembl ids that I have in the TCGA data.
1: My first problem is that rarely but sometimes mapIds returns NA to some genes and I am not sure why becuase when I search them on ensembl.org they do have an enembl id. Is this something to do with transcript ids? Is there a way to fix this?
I can use gmtFile[gmtFile == "AC093012.1"] <- NA to temporarily get around the problem but I know this is not best practice and I would love it if somebody has a solution.
2: My other problem is that when I test wether the ensembl ids are in the TCGA data I sometimes find that they are not there. I have noticed that this usually happens when these genes are Human Alternative sequence Gene or at least have Human Alternative sequence Gene as well as the regular Human Gene. Again I can delete the gene from the geneset as it is not in the dataset but is this okay to do?
3: My final question is that mapIds() often returns 1:many mapping between keys and columns. I guess this is because the symbols will often have multiple ensembl id. I have been using
multiVals = "first" to just get the first ensembl id for that gene, but is this okay or should i be extending the geneset to create extra genes for the multiple ensembl ids?
Examples of genes that mapIds() returns NA:
- AC093012.1: ENSG00000257896
- HBBP1: ENSG00000229988
- MIR1-2: ENSG00000284453
- MIR19B1: ENSG00000284375
- MIR19B2: ENSG00000284107
- MIR29B1: ENSG00000284203
- MIR29B2: ENSG00000284203
- MIR665: ENSG00000283159
- SHLD2P3: ENSG00000189014
- MEIS3P1: ENSG00000179277
- C7ORF50: ENSG00000146540
- C1ORF109: ENSG00000116922
- C1ORF115: ENSG00000162817
- CXORF38: ENSG00000185753
- CSF2RBP1: ENSG00000232254
- C1ORF174: ENSG00000198912
- VENTXP7: ENSG00000236380
- RBMS1P1: ENSG00000225422
- FAM182B: ENSG00000175170
- RBMY2AP: ENSG00000226092
Examples of genes that I cant find in the TCGA dataset:
- HLA-DRB4: ENSG00000227357/ ENSG00000227826/ ENSG00000231021
- HLA-DRB3: ENSG00000230463/ ENSG00000231679/ ENSG00000196101
- C4B_2: ENSG00000233312
- MUC2: ENSG00000198788 / ENSG00000278466/ ENSG00000284971
library(org.Hs.eg.db) library(AnnotationDbi) mapIds( x = org.Hs.eg.db, keys = currentgeneset, "ENSEMBL", "SYMBOL", fuzzy = TRUE, multiVals = "first")
I know these questions have been asked a lot on here but I can't seem to find the answers I'm after.
Thanks in advance for any help!!