Converting HGNC to ensembl and entrez id's using biomart
1
0
Entering edit mode
22 months ago

I have a vector of gene id's

head(data)
[1] "Ank2"   "Scg2"   "Nefh"   "Sgip1"  "Amph"   "Srcin1"


I used this:

require(biomaRt)
mart=useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
mapping <- getBM(attributes=c("hgnc_symbol","ensembl_gene_id","entrezgene_id"), filters = "hgnc_symbol", mart=mart, values=data, uniqueRows=TRUE, bmHeader = T)
Cache found

mapping
[1] HGNC symbol                        Gene stable ID
[3] NCBI gene (formerly Entrezgene) ID
<0 rows> (or 0-length row.names)


Why does it say cache found. What does it mean?

R biomart • 2.6k views
2
Entering edit mode
22 months ago

Hey,

Cache relates to this parameter of getBM():

useCache: Boolean indicating whether the results cache should be used.

Setting to ‘FALSE’ will disable reading and writing of the

cache. This argument is likely to disappear after the cache

functionality has been tested more thoroughly.

It's basically data that is stored on your local drive from when you previously ran biomaRt. It goes without saying that you should restart your R session for every new analysis that you perform in order to clear cache and memory, and avoid re-using old variables that lurk in your workspace..

The problem in this case is that you have mouse gene symbols but are trying to suggest that they are HGNC symbols. HGNC is specific for Homo sapiens (human... us) - you will want MGI (mgi_symbol):

require(biomaRt)

mart <- useMart('ENSEMBL_MART_ENSEMBL', host = 'useast.ensembl.org')
mart <- useDataset('mmusculus_gene_ensembl', mart)

data <- c('Ank2','Scg2','Nefh','Sgip1','Amph','Srcin1')

mapping <- getBM(
attributes = c('mgi_symbol', 'ensembl_gene_id', 'entrezgene_id'),
filters = 'mgi_symbol',
mart = mart,
values = data,
uniqueRows = TRUE,

mapping

MGI symbol     Gene stable ID NCBI gene (formerly Entrezgene) ID
1       Amph ENSMUSG00000021314                             218038
2       Ank2 ENSMUSG00000032826                             109676
3       Nefh ENSMUSG00000020396                             380684
4       Scg2 ENSMUSG00000050711                              20254
5      Sgip1 ENSMUSG00000028524                              73094
6     Srcin1 ENSMUSG00000038453                              56013


Kevin

0
Entering edit mode

Thanks Kevin for pointing out the species error. It works fine now. Now my input file has 7289 genes with some duplicates. After conversion getBM removed the duplicate id's and returned 4731 id's. I do not want it to get rid of the duplicates as I will be combining the output to my original dataset for further downstream analysis. Is there any way to get around that with getBM?

0
Entering edit mode

Did you try uniqueRows = FALSE? Generally, with biomaRt, extra work is required after you perform the initial mapping. You will note that biomaRt does not even return the genes in the same order in which they were submit

For 1-to-1 mapping, org.Mm.eg.db may be a better option. See step 3, here: https://support.bioconductor.org/p/130727/#130733

1
Entering edit mode

uniqueRows = FALSE doesn't do it either. But yes, AnnotationDbi package provides the output like I want it. Thank you.