Question: Mapping Ensembl IDs to Entrez - Merge data frames
0
gravatar for rin
13 months ago by
rin30
rin30 wrote:

Hi everyone

I am working on a gene expression data set from TCGA, where genes are annotated with Ensembl IDs. I used Biomart to convert them to Entrez by using

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

genes <- getBM(
  filters="ensembl_gene_id_version",
  attributes=c("ensembl_gene_id", "entrezgene"),
  values=genesens,
  mart=mart)

But all I get is a list with the mapped IDs, while I want to add a column with Entrez to the corresponding Ensembl ID. Any ideas of how I should modify the above code?

Thank you in advance!

EDIT: Note that Ensembl in the initial data frame have dot suffix.

rna-seq biomart • 1.7k views
ADD COMMENTlink modified 13 months ago • written 13 months ago by rin30
1

It sounds like you need to perform a merge with genes and your expression matrix. If the TCGA does not have a version number then you can remove it with gsub("\\.\\d+","", genes$ensembl_gene_id)

ADD REPLYlink written 13 months ago by ejm32440
1

rina : You should take a look at @Mike Smith's answer here: A: Mapping Ensembl Gene IDs with dot suffix

ADD REPLYlink written 13 months ago by genomax71k

Looking at the NAs that came up after mapping to entrez, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?

This is the code I used

   mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    genes.entrez <- getBM(
      filters="ensembl_gene_id",
      attributes=c("ensembl_gene_id", "entrezgene"),
      values=genes.nodot,
      mart=mart)
ADD REPLYlink written 13 months ago by rin30
3
gravatar for arta
13 months ago by
arta540
Sweden
arta540 wrote:

Try this.

source("https://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite("clusterProfiler")
library(clusterProfiler)
library(org.Hs.eg.db)
gene.df <- bitr(gene.list, fromType = "ENSEMBL",
                        toType = c( "ENTREZID", "SYMBOL"),
                        OrgDb = org.Hs.eg.db)
ADD COMMENTlink modified 13 months ago • written 13 months ago by arta540
1

Running the code returns this message

select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(genesens2, fromType = "ENSEMBL", toType = c("ENTREZID",  :
  57.95% of input gene IDs are fail to map...

I know that not all IDs can be mapped, but is such a high percentage normal? In addition to that, I am still unsure how everything can be included right into the data frame, as I have to use the expression matrix with the Entrez IDs further. Especially taking into account that not all the IDs will be mapped, I am not able to just add a column.

ADD REPLYlink written 13 months ago by rin30

Thanks for your help! Where is the bitr function from? It is not recognised from any of my installed packages.

ADD REPLYlink written 13 months ago by rin30

I have updated the code, i forgot to add the other package.

ADD REPLYlink written 13 months ago by arta540

Could you elaborate please? Which package is bitr defined in. Does it deal with the gene-builds appropriately?

ADD REPLYlink written 13 months ago by russhh4.7k
0
gravatar for cpad0112
13 months ago by
cpad011211k
India
cpad011211k wrote:

some example data from genesens object would help @ rina

ADD COMMENTlink written 13 months ago by cpad011211k

You are right.

Here are some ENSG00000000003.13 ENSG00000000005.5 ENSG00000000419.11 ENSG00000000457.12

ADD REPLYlink written 13 months ago by rin30

with example ids and OP code, following is the result:

> genes
  ensembl_gene_id entrezgene
1 ENSG00000000005      64102

Output ensembl gene IDs have no suffix. If you would like to merge the data frames (data data frame and results data frame) , you can merge them by ensembl_gene_id. If you could post few lines from dataframe and results (with few matching rows), that would be helpful.

If you want to add, gene symbol at the end, add 'hgnc_symbol' to the attribultes list.

> genes
  ensembl_gene_id entrezgene hgnc_symbol
1 ENSG00000000005      64102        TNMD
ADD REPLYlink modified 13 months ago • written 13 months ago by cpad011211k

Data frameĀ“s first column has Ensembl IDs such as the following. Rest of the columns are raw counts of expression data

 [1] "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11" "ENSG00000000971.14" "ENSG00000001036.12" "ENSG00000001084.9" 
[9] "ENSG00000001167.13"

The results I get after the mapping look like this.

ensembl_gene_id entrezgene
1 ENSG00000000005      64102
2 ENSG00000001561      22875
3 ENSG00000004478       2288
4 ENSG00000004799       5166
5 ENSG00000005022        292
6 ENSG00000005073       3207
ADD REPLYlink written 13 months ago by rin30
1

Well, there are ways to join the data frames using fuzzy logic or with some hacks. with some hacks (easy way): (note: genes is the list of ensembl example genes posted above and genesens is result from biomart)

> head(genes,3)
                  V1
1  ENSG00000000005.5
2 ENSG00000000419.11
3 ENSG00000000457.12
>library(stringr)
>genes$V2=str_split_fixed(genes$V1,"\\.",2)[,1]
>dplyr::left_join(genes, genesens, by=c("V2"="ensembl_gene_id"))
                  V1              V2 entrezgene
1  ENSG00000000005.5 ENSG00000000005      64102
2 ENSG00000000419.11 ENSG00000000419         NA
3 ENSG00000000457.12 ENSG00000000457         NA
4 ENSG00000000460.15 ENSG00000000460         NA
5 ENSG00000000938.11 ENSG00000000938         NA
6 ENSG00000000971.14 ENSG00000000971         NA
7 ENSG00000001036.12 ENSG00000001036         NA
8  ENSG00000001084.9 ENSG00000001084         NA
9 ENSG00000001167.13 ENSG00000001167         NA

With fuzzy logic, it would be:

>library(fuzzyjoin)
>regex_left_join(genes, genesens,by=c("V1"="ensembl_gene_id"))

                  V1 ensembl_gene_id entrezgene
1  ENSG00000000005.5 ENSG00000000005      64102
2 ENSG00000000419.11            <NA>         NA
3 ENSG00000000457.12            <NA>         NA
4 ENSG00000000460.15            <NA>         NA
5 ENSG00000000938.11            <NA>         NA
6 ENSG00000000971.14            <NA>         NA
7 ENSG00000001036.12            <NA>         NA
8  ENSG00000001084.9            <NA>         NA
9 ENSG00000001167.13            <NA>         NA
ADD REPLYlink modified 13 months ago • written 13 months ago by cpad011211k

Thank you so much for your help! The entrezgene is an integer and left join can only used to characters. Should I just convert it with the toString function? Excuse my very basic question, but I am just starting working with R.

ADD REPLYlink written 13 months ago by rin30

Can you print the data structure of common columns between the two frames?

ADD REPLYlink written 13 months ago by cpad011211k

Expression matrix columns

                                   X1 TCGA-AA-3815-01A-01R-1022-07 TCGA-NH-A5IV-01A-42R-A37K-07
ENSG00000000003.13 ENSG00000000003.13                         2449                         4369
ENSG00000000005.5   ENSG00000000005.5                            6                           58
ENSG00000000419.11 ENSG00000000419.11                          487                         1168
ENSG00000000457.12 ENSG00000000457.12                          269                         1049
ENSG00000000460.15 ENSG00000000460.15                          177                          533
ENSG00000000938.11 ENSG00000000938.11                          331                          858

that I turned into

"ENSG00000000003"       ENSG00000000005"        "ENSG00000000419"        "ENSG00000000457"        "ENSG00000000460"        "ENSG00000000938"        "ENSG00000000971"

by using nth(tstrsplit(genes, split ="\\."),n=1)

Biomart result is the following matrix

ensembl_gene_id entrezgene
1 ENSG00000000003       7105
2 ENSG00000000005      64102
3 ENSG00000000419       8813
4 ENSG00000000457      57147
5 ENSG00000000460      55732
6 ENSG00000000938       2268

Everything column is "character" except entrezgene that is an integer.

ADD REPLYlink modified 13 months ago • written 13 months ago by rin30

Then your merge is on ensembl_gene_id column (from the result) and x1 column from the data matrix. Entrezgene column str doesn't affect left_join

ADD REPLYlink written 13 months ago by cpad011211k

This is the reason I am confused when I get this message

Error in UseMethod("groups") : 
  no applicable method for 'groups' applied to an object of class "character"

And as the entrezgene column is the only one not being character I assumed this was the problem.

ADD REPLYlink written 13 months ago by rin30
1

Input head:

> head(dat)
                             X1 TCGA.AA.3815.01A.01R.1022.07 TCGA.NH.A5IV.01A.42R.A37K.07
ENSG00000000003 ENSG00000000003                         2449                         4369
ENSG00000000005 ENSG00000000005                            6                           58
ENSG00000000419 ENSG00000000419                          487                         1168
ENSG00000000457 ENSG00000000457                          269                         1049
ENSG00000000460 ENSG00000000460                          177                          533
ENSG00000000938 ENSG00000000938                          331                          858

results head:

> head(results)
  ensembl_gene_id entrezgene
1 ENSG00000000003       7105
2 ENSG00000000005      64102
3 ENSG00000000419       8813
4 ENSG00000000457      57147
5 ENSG00000000460      55732
6 ENSG00000000938       2268

data structure of results (ncbi entries are integers)

> str(results)
'data.frame':   6 obs. of  2 variables:
 $ ensembl_gene_id: chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
 $ entrezgene     : int  7105 64102 8813 57147 55732 2268

output:

> dplyr::left_join(dat,results,by=c("X1"="ensembl_gene_id"))
               X1 TCGA.AA.3815.01A.01R.1022.07 TCGA.NH.A5IV.01A.42R.A37K.07 entrezgene
1 ENSG00000000003                         2449                         4369       7105
2 ENSG00000000005                            6                           58      64102
3 ENSG00000000419                          487                         1168       8813
4 ENSG00000000457                          269                         1049      57147
5 ENSG00000000460                          177                          533      55732
6 ENSG00000000938                          331                          858       2268

check if there are conflicting packages with dplyr (among loaded packages) and also check the structure of common columns. For eg. str(dat$X1) and str(results$ensembl_gene_id) from the above example. Both must match.

ADD REPLYlink modified 13 months ago • written 13 months ago by cpad011211k

I was mistakenly putting the Ensembl ID column instead of the whole data frame as an argument to the left join function. It worked just fine now. Thanks for helping!

ADD REPLYlink written 13 months ago by rin30

rina : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they all work.
Upvote|Bookmark|Accept

Note: I have moved @cpad0112's original comment to an answer to maintain the train of throught.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1838 users visited in the last hour