Question

Conversion of Gene Name to Ensembl ID

3

Entering edit mode

5.6 years ago

bazok ▴ 40

Hi, Please I have searched alot but none of the solutions I have seen has fully been helpful. I want to convert a list of >20k genes names to Ensemble ID. Any script/tool/guide would really be helpful.

Thanks

gene RNA-Seq R genome • 39k views

ADD COMMENT • link updated 6 weeks ago by Thanujay S • 0 • written 5.6 years ago by bazok ▴ 40

0

Entering edit mode

Hello. Please paste a sample of the gene names that you have, and state the species, which will also help.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Few of the gene names/symbol are below. A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,A3GALT2,A4GALT,A4GNT

Thanks

ADD REPLY • link 5.6 years ago by bazok ▴ 40

0

Entering edit mode

Thanks. These seem to be HGNC symbols. Both solutions below should help you. Please take time to check.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks all for the inputs, I will run through them and feedback.

Regards

ADD REPLY • link 5.6 years ago by bazok ▴ 40

2

Entering edit mode

5.6 years ago

MatthewP ★ 1.4k

Hello, here is some way I know.
1. R package org.Hs.eg.db, this package contains mapping between gene IDs, like SYMBOL, entrez ID, Ensembl ID.
2. R package biomaRt, this package helps you query information(including gene ID mapping) from BioMart.
3. You can download gene ID data from BioMart. Select Ensembl Genes 99 --> Human genes --> Attributes --> GENE --> External References --> select HGNC symbol and NCBI gene ID --> Results. If you don't know how to use R, you can use this file with other language.

ADD COMMENT • link 5.6 years ago by MatthewP ★ 1.4k

1

Entering edit mode

4.8 years ago

Kevin Blighe 89k

[Yet] Another method here, by Pierre: A: Converting Ensembl Gene Ids To Hgnc Gene Name / Coordinates

ADD COMMENT • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

6 weeks ago

Thanujay S • 0

Hey! A bit late to the party! I’ve built a simple wrapper (named SESC) around biomaRt to streamline Ensembl ID conversions. It supports both single queries for quick lookups and batch conversions for larger datasets.

Single Mode

Rscript SESC_v0.1.R -m single -q ENSG00000012048 -a ensembl_gene_id,hgnc_symbol -f ensembl_gene_id -o stdout

Batch Mode

Rscript SESC_v0.1.R -m batch -i test_batch.txt -a ensembl_gene_id,hgnc_symbol -f ensembl_gene_id -o test_batch_output.txt

GitHub Repo: https://github.com/Thanujay/SESC

Thank you!

ADD COMMENT • link 6 weeks ago by Thanujay S • 0

score 7 · Accepted Answer · 2020-03-31

7

Entering edit mode

5.6 years ago

Arup Ghosh 3.5k

As the organism is not mentioned I'm sharing a R snippet with human as a placeholder.

library("AnnotationDbi")
library("org.Hs.eg.db")
df$ensid = mapIds(org.Hs.eg.db,
                    keys=df$symbol, 
                    column="ENSEMBL",
                    keytype="SYMBOL",
                    multiVals="first")

ADD COMMENT • link 5.6 years ago by Arup Ghosh 3.5k

1

Entering edit mode

Thanks Arup! It's good to have both the biomaRt and AnnotationDbi solutions. Judging by the question title, this particular question will be picked up extensively by search engines

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks alot, with the above and hints from the below link, I was able to convert around 20k gene symbols to ensembl. there are 3.3k that returned "NA". I tried biomaRt to recover the remaining 3.3k but I keep getting error (Error in bmRequest(request = request, verbose = verbose) : Internal Server Error (HTTP 500) which I am still not able to resolve. Any help will be appreciated.

Can't fetch pathways by entrez id?

Regards

ADD REPLY • link 5.6 years ago by bazok ▴ 40

score 5 · Accepted Answer · 2020-03-31

5

Entering edit mode

5.6 years ago

Kevin Blighe 89k

Assuming that you have HGNC symbols, you can achieve this via biomaRt in R:

require('biomaRt')

mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)

annotLookup <- getBM(
  mart = mart,
  attributes = c(
    'hgnc_symbol',
    'ensembl_gene_id',
    'gene_biotype'),
  uniqueRows = TRUE)

head(annotLookup)
  hgnc_symbol ensembl_gene_id   gene_biotype
1       MT-TF ENSG00000210049        Mt_tRNA
2     MT-RNR1 ENSG00000211459        Mt_rRNA
3       MT-TV ENSG00000210077        Mt_tRNA
4     MT-RNR2 ENSG00000210082        Mt_rRNA
5      MT-TL1 ENSG00000209082        Mt_tRNA
6      MT-ND1 ENSG00000198888 protein_coding

tail(annotLookup)
      hgnc_symbol ensembl_gene_id         gene_biotype
67142             ENSG00000285949               lncRNA
67143             ENSG00000284921               lncRNA
67144             ENSG00000285440 processed_pseudogene
67145             ENSG00000285110 processed_pseudogene
67146    MTRF1LP2 ENSG00000285363 processed_pseudogene
67147       GSDMC ENSG00000285114       protein_coding

tail(subset(annotLookup, hgnc_symbol != ''))
      hgnc_symbol ensembl_gene_id         gene_biotype
67137  RNU6-1233P ENSG00000285461                snRNA
67139      RUVBL1 ENSG00000284901       protein_coding
67140   RNU6-823P ENSG00000284805                snRNA
67141      EEFSEC ENSG00000284869       protein_coding
67146    MTRF1LP2 ENSG00000285363 processed_pseudogene
67147       GSDMC ENSG00000285114       protein_coding

Then, use annotLookup as a lookup table for your genes.

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 89k

1

Entering edit mode

This is perfect Kevin, thanks!

ADD REPLY • link 2.0 years ago by Nat.Nataren ▴ 100

0

Entering edit mode

Hi Kevin, Could you please elaborate a little on how to use the annotLookup. Suppose i have an input file- "genelist.csv"?

Thanks

ADD REPLY • link 5.6 years ago by bazok ▴ 40

0

Entering edit mode

Hi, you just need to read the data into R via read.csv() or read.table(). There, you can match your genes to the output contained in annotLookup.

However, it seems that your problem has now been resolved.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, Thanks for the reply. Yes, i was able to get past that but I still run into an error (Error in bmRequest(request = request, verbose = verbose) : Internal Server Error (HTTP 500) . I am still not able to fix this to convert the remaining 3.3k gene codes.

ADD REPLY • link 5.6 years ago by bazok ▴ 40

1

Entering edit mode

Hi, sometimes there is a problem with the Ensembl 'mirror' that is automatically chosen by biomaRt. Please try the same code but with this line:

mart <- useMart('ENSEMBL_MART_ENSEMBL', host = 'uswest.ensembl.org')

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

score 2 · Accepted Answer · 2020-03-31

Using Enembl REST API:

http://rest.ensembl.org/lookup/symbol/homo_sapiens/A1CF

assembly_name: GRCh38
biotype: protein_coding
db_type: core
description: APOBEC1 complementation factor [Source:HGNC Symbol;Acc:HGNC:24086]
display_name: A1CF
end: 50885675
id: ENSG00000148584
logic_name: ensembl_havana_gene_homo_sapiens
object_type: Gene
seq_region_name: 10
source: ensembl_havana
species: homo_sapiens
start: 50799409
strand: -1
version: 15


http://rest.ensembl.org/lookup/symbol/homo_sapiens/A1CF?content-type=application/json

{"strand":-1,"assembly_name":"GRCh38","version":15,"species":"homo_sapiens","end":50885675,"description":"APOBEC1 complementation factor [Source:HGNC Symbol;Acc:HGNC:24086]","source":"ensembl_havana","db_type":"core","object_type":"Gene","id":"ENSG00000148584","seq_region_name":"10","display_name":"A1CF","start":50799409,"logic_name":"ensembl_havana_gene_homo_sapiens","biotype":"protein_coding"}

Look up multiple symbols at one time:

$ wget -q --header='Content-type:application/json' --header='Accept:application/json' --post-data='{ "symbols" : ["A1BG","A1BG-AS1","A1CF" ] }' 'http://rest.ensembl.org/lookup/symbol/homo_sapiens'  -O -

{"A1CF":{"object_type":"Gene","version":15,"db_type":"core","seq_region_name":"10","end":50885675,"display_name":"A1CF","id":"ENSG00000148584","assembly_name":"GRCh38","source":"ensembl_havana","biotype":"protein_coding","start":50799409,"strand":-1,"logic_name":"ensembl_havana_gene_homo_sapiens","species":"homo_sapiens","description":"APOBEC1 complementation factor [Source:HGNC Symbol;Acc:HGNC:24086]"},"A1BG-AS1":{"start":58347718,"strand":1,"logic_name":"havana_homo_sapiens","species":"homo_sapiens","description":"A1BG antisense RNA 1 [Source:HGNC Symbol;Acc:HGNC:37133]","source":"havana","biotype":"lncRNA","id":"ENSG00000268895","assembly_name":"GRCh38","object_type":"Gene","version":6,"seq_region_name":"19","db_type":"core","end":58355455,"display_name":"A1BG-AS1"},"A1BG":{"description":"alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]","logic_name":"ensembl_havana_gene_homo_sapiens","species":"homo_sapiens","strand":-1,"start":58345178,"biotype":"protein_coding","source":"ensembl_havana","assembly_name":"GRCh38","id":"ENSG00000121410","display_name":"A1BG","seq_region_name":"19","version":12,"end":58353492,"db_type":"core","object_type":"Gene"}}