Hugo_Symbol to Entrez ID
1
0
Entering edit mode
11 months ago

Hello,

I have Myeloid-Acute Myeloid Leukemia (AML) RNAseq data file data_mrna_seq_rpkm.csv. This file has Hugo_Symbols for all 22,844 genes but not its Entrez IDs. I was able use to two methods in R programming 1) org.Hs.eg.db::mapIDs method and 2) biomaRt method to get the entrez_ID of only 16,569 genes from their respective hugo symbols and got 'NA' values for the entrez id of rest 6,275 Hugo symbols. How can I replace the 'NA' values with their respective entrez IDs? Please guide me to get the Entrez Ids of all 22,844 genes (Hugo_symbols).

1. using library(org.Hs.eg.db) mapIDs method

library(org.Hs.eg.db)

# Read your CSV data
data <- read.csv("data_mrna_seq_rpkm.csv", stringsAsFactors=FALSE)

# Get the mapping
entrez_ids <- mapIds(org.Hs.eg.db, 
                     keys = data$Hugo_Symbol, 
                     column = "ENTREZID", 
                     keytype = "SYMBOL", 
                     multiVals = "first")

# Add the Entrez IDs to your data frame
data$Entrez_Gene_Id <- entrez_ids

write.csv(data, "updated_data_mrna_seq_rpkm.csv", row.names = FALSE)

2. Using biomaRT

# Install and load necessary libraries
install.packages("biomaRt")
library(biomaRt)
library(readr)

# Read the CSV file
df <- read_csv("updated_data_mrna_seq_rpkm.csv")

# Connect to the Ensembl BioMart database
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get the Hugo Symbols with NA Entrez_Gene_Id
hugo_symbols_with_na <- df$Hugo_Symbol[is.na(df$Entrez_Gene_Id)]

listAttributes(mart)

# Fetch their Entrez IDs from BioMart
genes <- getBM(attributes = c('hgnc_symbol', 'entrezgene_id'), 
               filters = 'hgnc_symbol', 
               values = hugo_symbols_with_na, 
               mart = mart)


# Replace NA values in the original dataframe with fetched Entrez IDs
for(i in 1:nrow(genes)) {
  mask <- df$Hugo_Symbol == genes$hgnc_symbol[i] & is.na(df$Entrez_Gene_Id)
  df$Entrez_Gene_Id[mask] <- genes$entrezgene[i]
}

# Save the updated dataframe back to CSV
write_csv(df, "updated_data_mrna_seq_rpkm_updated.csv")
entrez R hugo-symbol org.hs.eg.db biomart • 1.1k views
ADD COMMENT
0
Entering edit mode

Can you provide some examples of HUGO ID's you are unable to convert?

ADD REPLY
0
Entering edit mode

Yes sure. These are all gene IDs BZRAP1, C19orf60, TCEB3 and so on.

ADD REPLY
0
Entering edit mode

Using EntrezDirect (LINK):

$ esearch -db gene -query "TSPAN6 [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
7105

$ esearch -db gene -query "C19orf60 [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
55049

$ esearch -db gene -query "TCEB3  [gene] AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element Id
6924
ADD REPLY
0
Entering edit mode
11 months ago

It is likely that many of these genes have no corresponding Entrez Gene ID. Entrez (NCBI), HUGO, and Ensembl have different 'rules' about how to prove that a gene exists.

What could try is to build an annotation table via org.Hs.eg.db, which may provide more information. For example:

annot.table<- select(
  org.Hs.eg.db,
  keys = data$Hugo_Symbol,
  column = c('SYMBOL', 'ENTREZID', 'ENSEMBL', 'GENENAME', 'REFSEQ'),
  keytype = 'SYMBOL')

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 2387 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6