Question: Formatting problem when converting from UniProt to Entrez Gene ID format
gravatar for EverInEarnest
3.2 years ago by
EverInEarnest30 wrote:

My code below reads in a file of a subset of the DrugBank data, and then calls to map the UniProt IDs of the drug targets to Entrez Gene ID format. This code runs and generates the output file, but the output is incorrect, and I am confused by the following issues:

My input file contains 12,370 values; however, the mapped Entrez Gene ID dataset contains 12,530 values. Given the simple R script below, I'm not sure why these additional values are being introduced. Inspecting the output file, I see that for some of the listed UniProt values, the value looks like an Entrez ID (i.e., a number with no character prefix), and the corresponding value assigned in the Entrez column is "NA". Inspecting the UniProt values in the input file, there are no such non-UniProt values present, so I'm not sure where these problematic values are originating.

Also, and concerningly, the data in the MappedData output file does not match the UniProt IDs from the input file. For example, the first UniProt ID listed in the input file is P00734, whereas the first UniProt value in the MappedData output file is O95169.

If anyone can provide insight into what is wrong with my script below such that the UniProt IDs are not being correctly mapped to Entrez Gene format, I will greatly appreciate your guidance.

# Note: for some reason, the left parentheses of this library() call isn't showing up on this post, but it is present in my actual code

DrugBank_Data <- read.csv("DrugBankData.csv")

TargetID_UniProt <- DrugBank_Data[,2]

# Stereotyped call that is always used to create a object
up <-

MappedData <- select(up, TargetID_UniProt, "ENTREZ_GENE")

write.csv(MappedData, "MappedData.csv")
entrez conversion uniprot R • 1.1k views
ADD COMMENTlink modified 3.2 years ago by Kevin Blighe67k • written 3.2 years ago by EverInEarnest30
gravatar for Kevin Blighe
3.2 years ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

I have never used the package, so, I don't know if there are any possible parameters in the select() function that my help. There is a previous thread here: Extracting domain list for proteins Using in R

Otherwise, may I recommend the use of biomaRt? I have used this in the past to convert between ENTREZ and RefSeq Official Gene Symbols. For uniprot, for your data, I think that the code would be something like:


uniprot <- useMart("unimart", dataset="uniprot")

annots <- getBM(mart=uniprot, attributes=c("uniprot_swissprot", "ensembl_gene_id"), filter="uniprot_swissprot", values=TargetID_UniProt, uniqueRows=TRUE)

The uniqueRows parameter is important, and there are also other attributes that you can have returned, such as gene_biotype and external_gene_name.

Finally, there is a useful tutorial for using biomaRt, including with uniprot, here:

Hope that this helps


ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Kevin Blighe67k

Many thanks for your detailed response, Kevin; I have also been exploring another route, and will consider your suggestions along with my current efforts; I'll plan to post what works.

ADD REPLYlink written 3.2 years ago by EverInEarnest30

Okay, great! - Good luck

ADD REPLYlink written 3.2 years ago by Kevin Blighe67k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1217 users visited in the last hour