Question

Formatting problem when converting from UniProt to Entrez Gene ID format

0

Entering edit mode

6.7 years ago

EverInEarnest ▴ 40

My code below reads in a file of a subset of the DrugBank data, and then calls UniProt.ws() to map the UniProt IDs of the drug targets to Entrez Gene ID format. This code runs and generates the output file, but the output is incorrect, and I am confused by the following issues:

My input file contains 12,370 values; however, the mapped Entrez Gene ID dataset contains 12,530 values. Given the simple R script below, I'm not sure why these additional values are being introduced. Inspecting the output file, I see that for some of the listed UniProt values, the value looks like an Entrez ID (i.e., a number with no character prefix), and the corresponding value assigned in the Entrez column is "NA". Inspecting the UniProt values in the input file, there are no such non-UniProt values present, so I'm not sure where these problematic values are originating.

Also, and concerningly, the data in the MappedData output file does not match the UniProt IDs from the input file. For example, the first UniProt ID listed in the input file is P00734, whereas the first UniProt value in the MappedData output file is O95169.

If anyone can provide insight into what is wrong with my script below such that the UniProt IDs are not being correctly mapped to Entrez Gene format, I will greatly appreciate your guidance.

# Note: for some reason, the left parentheses of this library() call isn't showing up on this post, but it is present in my actual code
libraryUniProt.ws)

DrugBank_Data <- read.csv("DrugBankData.csv")

TargetID_UniProt <- DrugBank_Data[,2]

# Stereotyped call that is always used to create a UniProt.ws object
up <- UniProt.ws(taxId=9606)

MappedData <- select(up, TargetID_UniProt, "ENTREZ_GENE")

write.csv(MappedData, "MappedData.csv")

R conversion UniProt Entrez • 2.1k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 87k • written 6.7 years ago by EverInEarnest ▴ 40

score 0 · Answer 1 · 2017-09-14

I have never used the package UniProt.ws, so, I don't know if there are any possible parameters in the select() function that my help. There is a previous thread here: Extracting domain list for proteins Using UniProt.ws in R

Otherwise, may I recommend the use of biomaRt? I have used this in the past to convert between ENTREZ and RefSeq Official Gene Symbols. For uniprot, for your data, I think that the code would be something like:

require(biomaRt)

uniprot <- useMart("unimart", dataset="uniprot")

annots <- getBM(mart=uniprot, attributes=c("uniprot_swissprot", "ensembl_gene_id"), filter="uniprot_swissprot", values=TargetID_UniProt, uniqueRows=TRUE)

The uniqueRows parameter is important, and there are also other attributes that you can have returned, such as gene_biotype and external_gene_name.

Finally, there is a useful tutorial for using biomaRt, including with uniprot, here: http://www.ensembl.org/info/data/biomart/biomart_r_package.html

Hope that this helps

Kevin