Question

Get gene symbols from gene ids for mouse using BioMart

3

Entering edit mode

6.2 years ago

nikitavlassenko ▴ 110

I am trying to get gene symbols for gene ids that I got for mouse datasets. Gene ids look like that: 0610009B22Rik. The code that I am trying to utilize is the following one:

ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
mouse_gene_ids <- dataset[, 1]
foo <- getBM(attributes=c('ensembl_gene_id',
                      'external_gene_name'),
         filters = 'genedb',
         values = mouse_gene_ids,
         mart = ensembl)

I am getting zero results as an output after the query runs. I guess filters parameter is wrong. Any suggestions would be greatly appreciated.

BioMart mouse gene ids gene symbols • 20k views

ADD COMMENT • link 6.2 years ago by nikitavlassenko ▴ 110

score 9 · Accepted Answer · 2018-02-28

9

Entering edit mode

6.2 years ago

Mike Smith ★ 2.0k

The filter you need is mgi_symbol e.g.

library(biomaRt)

ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
mouse_gene_ids  <- "0610009B22Rik"

foo <- getBM(attributes=c('ensembl_gene_id',
                          'external_gene_name'),
             filters = 'mgi_symbol',
             values = mouse_gene_ids,
             mart = ensembl)

Here's the result:

> foo
     ensembl_gene_id external_gene_name
1 ENSMUSG00000007777      0610009B22Rik

I find the best way to choose the correct filter is to start with the Ensembl BioMart web interface, use the examples in the Filters -> external references ID list dropdown list to find the format I'm using, and then hit the XML button near the top. This will let you see the filter name required by biomaRt

ADD COMMENT • link 6.2 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

library(biomaRt)
listMarts()
ensembl=useMart("ensembl")
datasets <- listDatasets(ensembl)
head(datasets)
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
entrzID=c("14455", "80904", "94275")
filters = listFilters(ensembl)
filters[1:50,]
getBM(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "mgi_symbol", values = entrzID, mart = ensembl)

output:

[1] ensembl_gene_id    external_gene_name
<0 rows> (or 0-length row.names)

Why can't I get any gene symbol

ADD REPLY • link 4.7 years ago by Kai_Qi ▴ 130

1

Entering edit mode

Perhaps you want to try the entrezgene_id filter instead?

getBM(attributes = c("ensembl_gene_id", "external_gene_name"), 
       filters = "entrezgene_id", 
       values = entrzID, 
       mart = ensembl)

     ensembl_gene_id external_gene_name
1 ENSMUSG00000040415               Dtx3
2 ENSMUSG00000025151             Maged1

ADD REPLY • link 4.7 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

Yes. I tried it and it works.

Thank you so much for the help!

ADD REPLY • link 4.7 years ago by Kai_Qi ▴ 130

0

Entering edit mode

Hello, I tried to follow the previous posts and actually everything worked but I did not get anything back as result. My code below: library(biomaRt) ensembl <- useMart("ensembl",dataset="mmusculus_gene_ensembl") genes_ids <- c('ENSMUSG00000051951.5', 'ENSMUSG00000025900.12', 'ENSMUSG00000025902.13') gs_heatdata <- getBM(attributes = c("external_gene_name"), filters = "mgi_symbol", values = genes_ids, mart = ensembl)

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

1

Entering edit mode

Hi, you need to remove the trailing numbers from the gene IDs. Also, the value for filters should be ensembl_gene_id. Please try this:

library(biomaRt)
ensembl <- useMart('ensembl', dataset = 'mmusculus_gene_ensembl')
genes_ids <- sub('\\.[0-9]*$', '',
  c('ENSMUSG00000051951.5', 'ENSMUSG00000025900.12', 'ENSMUSG00000025902.13'))
gs_heatdata <- getBM(
  attributes = c('external_gene_name', 'mgi_symbol','ensembl_gene_id'),
  filters = 'ensembl_gene_id',
  values = genes_ids,
  mart = ensembl)

gs_heatdata
  external_gene_name mgi_symbol    ensembl_gene_id
1                Rp1        Rp1 ENSMUSG00000025900
2              Sox17      Sox17 ENSMUSG00000025902
3               Xkr4       Xkr4 ENSMUSG00000051951

ADD REPLY • link 3.1 years ago by Kevin Blighe 87k

0

Entering edit mode

it works perfectly but I did not understand how you managed it: - the trailing number stands for the 0s before the actual id? - could you explain me in particular what sub('\\.[0-9]*$', '', refers to? thank you a lot!

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

2

Entering edit mode

That is a regular expression saying that substitute anything including a period and any number(s) between 1 and 9 with nothing (i.e. delete).

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

sorry I forgot one more question. How can I make the code "cleaner"? because the output in the end shows me two features that are the same, the 'external_gene_name' and 'mgi_symbol'.

Thank you!

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

2

Entering edit mode

Change following line

attributes = c('external_gene_name', 'mgi_symbol','ensembl_gene_id')

to

attributes = c('external_gene_name', 'ensembl_gene_id')

Or keep mgi_symbol if you want to keep that instead.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

I tried with my all dataset but it did not work. I just have in return the empty table with the external_gene_name and ensembl_gene_id as headers.

library(biomaRt)
ensembl <- useMart("ensembl",dataset="mmusculus_gene_ensembl")
genes_ids <- sub('\\.[0-9]*$', '',  row.names(heatdata))
gs_heatdata <- getBM(attributes = c('external_gene_name', 'ensembl_gene_id'), 
                 filters = "mgi_symbol",
                 values = genes_ids,
                 mart = ensembl)

head(heatdata)
                      T0medium T0medium T0medium    T0LAL    T0LAL    T0LAL    6hLAL    6hLAL    6hLAL   6hIMQ
ENSMUSG00000051951.5  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
ENSMUSG00000025900.12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
ENSMUSG00000025902.13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
ENSMUSG00000033845.13 8.635869 8.717134 8.644194 8.688051 8.729801 8.719839 8.522753 8.451425 8.588430 8.93282
ENSMUSG00000025903.14 9.244627 9.269090 9.357344 9.148911 9.297785 9.352155 9.265217 9.099127 9.255727 9.28542
ENSMUSG00000104217.1  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
                         6hIMQ    6hIMQ   16hLAL   16hLAL   16hLAL   16hIMQ   16hIMQ   16hIMQ
ENSMUSG00000051951.5  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000025900.12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000025902.13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSMUSG00000033845.13 8.838776 8.843039 8.541431 8.565437 8.534634 9.114412 9.122216 9.117485
ENSMUSG00000025903.14 9.392362 9.217806 9.207043 9.377954 9.266217 9.221498 9.238627 9.220453
ENSMUSG00000104217.1  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

0

Entering edit mode

Hi, the converted IDs are contained in gs_heatdata. You then have to align these to the rownames of heatdata, and then replace them with the external gene IDs (MGI symbols).

ADD REPLY • link 3.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, how can I align them? which function should I use? how can I then replace them with the external gene IDs? should I first convert the row.names of heatdata in the first column and then somehow combine the df gs_heatdata with the df heatdata? thank you a lot! :)

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

0

Entering edit mode

Hi, please take a look at functions such as which() and match(), and other functions from dplyr (package) for matching data-frames.

A quick example:

array1 <- c('a','b','c','d','e','f','g')
array2 <- c('e','f','g','a','b','c','d')
idx <- match(array1, array2)
data.frame(array1 = array1, array2 = array2[idx])
  array1 array2
1      a      a
2      b      b
3      c      c
4      d      d
5      e      e
6      f      f
7      g      g

ADD REPLY • link 3.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, I tried for now with match() but I think it did not work.

matched_heatdata <- match(gs_heatdata, heatdata)
matched_heatdata
[1] NA NA

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

0

Entering edit mode

match() returns the indices [in heatdata] of the elements of gs_heatdata

What you likely need is:

idx <- match(
  sub('\\.[0-9]*$', '', rownames(heatdata)),
  gs_heatdata$ensembl_gene_id)
gs_heatdata <- gs_heatdata[idx,]
all(sub('\\.[0-9]*$', '', rownames(heatdata)) == gs_heatdata$ensembl_gene_id) # must return TRUE

ADD REPLY • link 3.1 years ago by Kevin Blighe 87k

0

Entering edit mode

ok, I try this. Just for me to understand: can I also just use the previous genes_ids or I have to put the entire sub('\\.[0-9]*$', '', rownames(heatdata)) in match() and after all()? thank you!!

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

0

Entering edit mode

It returned this:

idx <- match(sub('\\.[0-9]*$', '', rownames(heatdata)),gs_heatdata$ensembl_gene_id)
gs_heatdata <- gs_heatdata[idx,]
all(sub('\\.[0-9]*$', '', rownames(heatdata)) == gs_heatdata$ensembl_gene_id)
[1] NA

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20

1

Entering edit mode

I think I found a problem and it was quite in front of me. the filters set were wrong. I had to use filters = "ensembl_gene_id" instead of filters = "mgi_symbol". now the gs_heatdata looks good:

external_gene_name    ensembl_gene_id
3079               Xkr4 ENSMUSG00000051951
424                 Rp1 ENSMUSG00000025900
425               Sox17 ENSMUSG00000025902
1951             Mrpl15 ENSMUSG00000033845
426              Lypla1 ENSMUSG00000025903
4321            Gm37988 ENSMUSG00000104217

but if I proceed with the previous code I get anyway NA:

idx <- match(sub('\\.[0-9]*$', '', rownames(heatdata)), gs_heatdata$ensembl_gene_id)
gs_heatdata <- gs_heatdata[idx,]
all(sub('\\.[0-9]*$', '', rownames(heatdata)) == gs_heatdata$ensembl_gene_id)
[1] NA

ADD REPLY • link 3.1 years ago by tommaso.gastaldi ▴ 20