Question

Map Protein Accession Number to Uniprot identifier in R

0

Entering edit mode

5.2 years ago

jose.wo ▴ 10

Hello,

I have a list of proteins which are identified by what the author of the list called "Protein Accession Number". When I look for this number in the ncbi protein search I do find the protein.

For example: The protein accession number 29436380 gives as a result the "MYH9 protein" and in the address of the website I see that the number is there: https://www.ncbi.nlm.nih.gov/protein/29436380

I have looked all over but I couldn't find what this identifier really is and I would like to use the list that I have in R with other bioconductor packages. For that I would need to match it with uniprot IDs but I can't find any package to do this, mainly because I don't know how this number identifier is called.

Can anyone help me by either pointing me to a package that can do this mapping or telling me the name of this "accession number"?

Best, José.

R protein accession number uniprot nbci • 3.6k views

ADD COMMENT • link 5.2 years ago by jose.wo ▴ 10

0

Entering edit mode

it looks like a GI accession (accessions with only numbers) - more on them : https://www.ncbi.nlm.nih.gov/genbank/sequenceids/

Not entirely sure, but maybe biomaRt might be useful to you, documentation for this : https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html. In your case you might need to play around with this package and find out if there is an "attribute" in biomaRt for GIs (just like there is for ensembl_gene_ids, uniprotswissprot ids for example). If yes, then you can query ncbi proteins using those list of GIs (I have a feeling that you might also need the sequence accession, but I might be wrong) and get meta or sequence info.

ADD REPLY • link 5.2 years ago by manaswwm ▴ 570

0

Entering edit mode

Thanks a lot, you pointed me in the right direction!!!

ADD REPLY • link 5.2 years ago by jose.wo ▴ 10

score 2 · Accepted Answer · 2020-04-17

Thanks a lot manaswwm!!! You pointed me in the right direction. What I finally did was to scrap the ncbi site to download the relevant information. I still have the problem that some records were removed or are obsolete but there is no way round that other than manually getting the information.

Here I leave you the code I used to retrieve the relevant information in case it can help someone.

require("rvest")
require("stringr")
require("dplyr")

extractInfo <- Vectorize(function(GInumber){
    tempPage  <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
    tempPage  <- base::paste(tempPage, collapse = "")
    Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
    Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
    GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
    out       <- paste(Symbol, Accession, GeneID, sep = "---")
    return(out)
})

You only need to pass the GI accession number to the "extractInfo" function and voila!! I have Vectorized the function because I needed to use it in a dplyr pipe operator.

Thanks again, José.