Converting HGNC to ensembl and entrez id's using biomart
1
0
Entering edit mode
3.7 years ago

I have a vector of gene id's

head(data)
[1] "Ank2"   "Scg2"   "Nefh"   "Sgip1"  "Amph"   "Srcin1"

I used this:

require(biomaRt)
mart=useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
mapping <- getBM(attributes=c("hgnc_symbol","ensembl_gene_id","entrezgene_id"), filters = "hgnc_symbol", mart=mart, values=data, uniqueRows=TRUE, bmHeader = T)
Cache found

 mapping
[1] HGNC symbol                        Gene stable ID                    
[3] NCBI gene (formerly Entrezgene) ID
<0 rows> (or 0-length row.names)

Why does it say cache found. What does it mean?

R biomart • 4.0k views
ADD COMMENT
2
Entering edit mode
3.7 years ago

Hey,

Cache relates to this parameter of getBM():

useCache: Boolean indicating whether the results cache should be used.

Setting to ‘FALSE’ will disable reading and writing of the

cache. This argument is likely to disappear after the cache

functionality has been tested more thoroughly.

It's basically data that is stored on your local drive from when you previously ran biomaRt. It goes without saying that you should restart your R session for every new analysis that you perform in order to clear cache and memory, and avoid re-using old variables that lurk in your workspace..

The problem in this case is that you have mouse gene symbols but are trying to suggest that they are HGNC symbols. HGNC is specific for Homo sapiens (human... us) - you will want MGI (mgi_symbol):

require(biomaRt)

mart <- useMart('ENSEMBL_MART_ENSEMBL', host = 'useast.ensembl.org')
mart <- useDataset('mmusculus_gene_ensembl', mart)

data <- c('Ank2','Scg2','Nefh','Sgip1','Amph','Srcin1')

mapping <- getBM(
  attributes = c('mgi_symbol', 'ensembl_gene_id', 'entrezgene_id'),
  filters = 'mgi_symbol', 
  mart = mart,
  values = data,
  uniqueRows = TRUE,
  bmHeader = T)

mapping

  MGI symbol     Gene stable ID NCBI gene (formerly Entrezgene) ID
1       Amph ENSMUSG00000021314                             218038
2       Ank2 ENSMUSG00000032826                             109676
3       Nefh ENSMUSG00000020396                             380684
4       Scg2 ENSMUSG00000050711                              20254
5      Sgip1 ENSMUSG00000028524                              73094
6     Srcin1 ENSMUSG00000038453                              56013

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin for pointing out the species error. It works fine now. Now my input file has 7289 genes with some duplicates. After conversion getBM removed the duplicate id's and returned 4731 id's. I do not want it to get rid of the duplicates as I will be combining the output to my original dataset for further downstream analysis. Is there any way to get around that with getBM?

ADD REPLY
0
Entering edit mode

Did you try uniqueRows = FALSE? Generally, with biomaRt, extra work is required after you perform the initial mapping. You will note that biomaRt does not even return the genes in the same order in which they were submit

For 1-to-1 mapping, org.Mm.eg.db may be a better option. See step 3, here: https://support.bioconductor.org/p/130727/#130733

ADD REPLY
1
Entering edit mode

uniqueRows = FALSE doesn't do it either. But yes, AnnotationDbi package provides the output like I want it. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6