Question

Hugo_Symbol to Entrez ID

0

Entering edit mode

7 months ago

shakyaram079 • 0

Hello,

I have Myeloid-Acute Myeloid Leukemia (AML) RNAseq data file data_mrna_seq_rpkm.csv. This file has Hugo_Symbols for all 22,844 genes but not its Entrez IDs. I was able use to two methods in R programming 1) org.Hs.eg.db::mapIDs method and 2) biomaRt method to get the entrez_ID of only 16,569 genes from their respective hugo symbols and got 'NA' values for the entrez id of rest 6,275 Hugo symbols. How can I replace the 'NA' values with their respective entrez IDs? Please guide me to get the Entrez Ids of all 22,844 genes (Hugo_symbols).

1. using library(org.Hs.eg.db) mapIDs method

library(org.Hs.eg.db)

# Read your CSV data
data <- read.csv("data_mrna_seq_rpkm.csv", stringsAsFactors=FALSE)

# Get the mapping
entrez_ids <- mapIds(org.Hs.eg.db, 
                     keys = data$Hugo_Symbol, 
                     column = "ENTREZID", 
                     keytype = "SYMBOL", 
                     multiVals = "first")

# Add the Entrez IDs to your data frame
data$Entrez_Gene_Id <- entrez_ids

write.csv(data, "updated_data_mrna_seq_rpkm.csv", row.names = FALSE)

2. Using biomaRT

# Install and load necessary libraries
install.packages("biomaRt")
library(biomaRt)
library(readr)

# Read the CSV file
df <- read_csv("updated_data_mrna_seq_rpkm.csv")

# Connect to the Ensembl BioMart database
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get the Hugo Symbols with NA Entrez_Gene_Id
hugo_symbols_with_na <- df$Hugo_Symbol[is.na(df$Entrez_Gene_Id)]

listAttributes(mart)

# Fetch their Entrez IDs from BioMart
genes <- getBM(attributes = c('hgnc_symbol', 'entrezgene_id'), 
               filters = 'hgnc_symbol', 
               values = hugo_symbols_with_na, 
               mart = mart)


# Replace NA values in the original dataframe with fetched Entrez IDs
for(i in 1:nrow(genes)) {
  mask <- df$Hugo_Symbol == genes$hgnc_symbol[i] & is.na(df$Entrez_Gene_Id)
  df$Entrez_Gene_Id[mask] <- genes$entrezgene[i]
}

# Save the updated dataframe back to CSV
write_csv(df, "updated_data_mrna_seq_rpkm_updated.csv")

entrez R hugo-symbol org.hs.eg.db biomart • 794 views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 7 months ago by shakyaram079 • 0

0

Entering edit mode

Can you provide some examples of HUGO ID's you are unable to convert?

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

Yes sure. These are all gene IDs BZRAP1, C19orf60, TCEB3 and so on.

ADD REPLY • link 7 months ago by shakyaram079 • 0

0

Entering edit mode

Using EntrezDirect (LINK):

$ esearch -db gene -query "TSPAN6 [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
7105

$ esearch -db gene -query "C19orf60 [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
55049

$ esearch -db gene -query "TCEB3  [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
6924

ADD REPLY • link 7 months ago by GenoMax 141k

score 0 · Answer 1 · 2023-09-24

It is likely that many of these genes have no corresponding Entrez Gene ID. Entrez (NCBI), HUGO, and Ensembl have different 'rules' about how to prove that a gene exists.

What could try is to build an annotation table via org.Hs.eg.db, which may provide more information. For example:

annot.table<- select(
  org.Hs.eg.db,
  keys = data$Hugo_Symbol,
  column = c('SYMBOL', 'ENTREZID', 'ENSEMBL', 'GENENAME', 'REFSEQ'),
  keytype = 'SYMBOL')

Kevin