Question: Mapping Ensembl Gene IDs with dot suffix
4
gravatar for mk
17 months ago by
mk90
mk90 wrote:

I have a bunch of bulk mRNA sequencing pulled off of the TCGA. Feature names appear to be Ensembl gene IDs with a suffix. Here is an example:

[995] "ENSG00000236246.1" "ENSG00000281088.1" [997] "ENSG00000254526.1" "ENSG00000223575.2" [999] "ENSG00000201444.1" "ENSG00000232573.1"

I am taking the intersection between these features and a set of Entrez Gene IDs. In order to do this I am using the biomaRt package to generate a mapping between Ensembl gene IDs and Entrez gene IDs. However, the only Entrez gene IDs I can find lack the suffixes. Here is the head of the table that maps Entrez genes to Ensemble genes:

  entrezgene ensembl_gene_id
1      90529 ENSG00000001460
2       9235 ENSG00000008517
3      10747 ENSG00000009724
4     654364 ENSG00000011052
5     112611 ENSG00000013392
6      57210 ENSG00000022567

Can someone explain what the Ensembl suffixes mean and how to convert these names to Entrez? If this can be done with biomaRt, it would be ideal. Thanks.

R bioconductor biomart ensembl gene • 2.3k views
ADD COMMENTlink modified 5 months ago by zx87547.9k • written 17 months ago by mk90

Related post at SO:

ADD REPLYlink written 5 months ago by zx87547.9k
8
gravatar for Emily_Ensembl
17 months ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

The numbers are version numbers. There is information about stable ID versioning here. You can just strip off the version numbers to use with biomaRt.

ADD COMMENTlink written 17 months ago by Emily_Ensembl18k
3
gravatar for Mike Smith
17 months ago by
Mike Smith1.3k
EMBL Heidelberg / de.NBI
Mike Smith1.3k wrote:

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'entrezgene'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version entrezgene
1       ENSG00000201444.1         NA
2       ENSG00000223575.2         NA
3       ENSG00000232573.1         NA
4       ENSG00000254526.1         NA

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. Better to do as Emily suggests, and strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000201444         NA
2 ENSG00000223575         NA
3 ENSG00000232573         NA
4 ENSG00000236246         NA
5 ENSG00000254526         NA

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The NA values for the rest are because there's no mapping between Ensembl and Entrez for those genes.

Just to check it's really working we'll demonstrate with some IDs that can be mapped.

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = c('ENSG00000001460', 'ENSG00000008517', 'ENSG00000009724'),
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000001460      90529
2 ENSG00000008517       9235
3 ENSG00000009724      10747
ADD COMMENTlink modified 17 months ago • written 17 months ago by Mike Smith1.3k
0
gravatar for Bastien Hervé
17 months ago by
Bastien Hervé4.4k
Limoges, CBRS, France
Bastien Hervé4.4k wrote:

Something like this ? In R console :

data <- c("ENSG00000236246.1","ENSG00000281088.1","ENSG00000254526.1","ENSG00000223575.2","ENSG00000201444.1","ENSG00000232573.1")
data_modified <- sapply(strsplit(data,"\\."), function(x) x[1])
ADD COMMENTlink written 17 months ago by Bastien Hervé4.4k
0
gravatar for PavolG
15 months ago by
PavolG0
Bethesda/NIH
PavolG0 wrote:

My favorite version to strip the versions. Used dplyr and data.table functions nth() and tstsplit() respectively.

nth(tstrsplit(gene_ids_version, split ="\\."),n=1)
ADD COMMENTlink modified 15 months ago • written 15 months ago by PavolG0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1524 users visited in the last hour