Question: Map Protein Accession Number to Uniprot identifier in R
gravatar for jose.wo
6 months ago by
jose.wo0 wrote:


I have a list of proteins which are identified by what the author of the list called "Protein Accession Number". When I look for this number in the ncbi protein search I do find the protein.

For example: The protein accession number 29436380 gives as a result the "MYH9 protein" and in the address of the website I see that the number is there:

I have looked all over but I couldn't find what this identifier really is and I would like to use the list that I have in R with other bioconductor packages. For that I would need to match it with uniprot IDs but I can't find any package to do this, mainly because I don't know how this number identifier is called.

Can anyone help me by either pointing me to a package that can do this mapping or telling me the name of this "accession number"?

Best, José.

ADD COMMENTlink modified 6 months ago • written 6 months ago by jose.wo0

it looks like a GI accession (accessions with only numbers) - more on them :

Not entirely sure, but maybe biomaRt might be useful to you, documentation for this : In your case you might need to play around with this package and find out if there is an "attribute" in biomaRt for GIs (just like there is for ensembl_gene_ids, uniprotswissprot ids for example). If yes, then you can query ncbi proteins using those list of GIs (I have a feeling that you might also need the sequence accession, but I might be wrong) and get meta or sequence info.

ADD REPLYlink written 6 months ago by manaswwm130

Thanks a lot, you pointed me in the right direction!!!

ADD REPLYlink written 6 months ago by jose.wo0
gravatar for jose.wo
6 months ago by
jose.wo0 wrote:

Thanks a lot manaswwm!!! You pointed me in the right direction. What I finally did was to scrap the ncbi site to download the relevant information. I still have the problem that some records were removed or are obsolete but there is no way round that other than manually getting the information.

Here I leave you the code I used to retrieve the relevant information in case it can help someone.


extractInfo <- Vectorize(function(GInumber){
    tempPage  <- readLines(paste("", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
    tempPage  <- base::paste(tempPage, collapse = "")
    Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
    Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
    GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
    out       <- paste(Symbol, Accession, GeneID, sep = "---")

You only need to pass the GI accession number to the "extractInfo" function and voila!! I have Vectorized the function because I needed to use it in a dplyr pipe operator.

Thanks again, José.

ADD COMMENTlink written 6 months ago by jose.wo0

Where possible don't use gi numbers for analysis. They are deprecated for end user use and should be replaced with accession numbers.

ADD REPLYlink written 6 months ago by genomax91k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 869 users visited in the last hour