Question: Map Protein Accession Number to Uniprot identifier in R
0
gravatar for jose.wo
6 months ago by
jose.wo0
jose.wo0 wrote:

Hello,

I have a list of proteins which are identified by what the author of the list called "Protein Accession Number". When I look for this number in the ncbi protein search I do find the protein.

For example: The protein accession number 29436380 gives as a result the "MYH9 protein" and in the address of the website I see that the number is there: https://www.ncbi.nlm.nih.gov/protein/29436380

I have looked all over but I couldn't find what this identifier really is and I would like to use the list that I have in R with other bioconductor packages. For that I would need to match it with uniprot IDs but I can't find any package to do this, mainly because I don't know how this number identifier is called.

Can anyone help me by either pointing me to a package that can do this mapping or telling me the name of this "accession number"?

Best, José.

ADD COMMENTlink modified 6 months ago • written 6 months ago by jose.wo0

it looks like a GI accession (accessions with only numbers) - more on them : https://www.ncbi.nlm.nih.gov/genbank/sequenceids/

Not entirely sure, but maybe biomaRt might be useful to you, documentation for this : https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html. In your case you might need to play around with this package and find out if there is an "attribute" in biomaRt for GIs (just like there is for ensembl_gene_ids, uniprotswissprot ids for example). If yes, then you can query ncbi proteins using those list of GIs (I have a feeling that you might also need the sequence accession, but I might be wrong) and get meta or sequence info.

ADD REPLYlink written 6 months ago by manaswwm130

Thanks a lot, you pointed me in the right direction!!!

ADD REPLYlink written 6 months ago by jose.wo0
1
gravatar for jose.wo
6 months ago by
jose.wo0
jose.wo0 wrote:

Thanks a lot manaswwm!!! You pointed me in the right direction. What I finally did was to scrap the ncbi site to download the relevant information. I still have the problem that some records were removed or are obsolete but there is no way round that other than manually getting the information.

Here I leave you the code I used to retrieve the relevant information in case it can help someone.

require("rvest")
require("stringr")
require("dplyr")

extractInfo <- Vectorize(function(GInumber){
    tempPage  <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
    tempPage  <- base::paste(tempPage, collapse = "")
    Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
    Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
    GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
    out       <- paste(Symbol, Accession, GeneID, sep = "---")
    return(out)
})

You only need to pass the GI accession number to the "extractInfo" function and voila!! I have Vectorized the function because I needed to use it in a dplyr pipe operator.

Thanks again, José.

ADD COMMENTlink written 6 months ago by jose.wo0

Where possible don't use gi numbers for analysis. They are deprecated for end user use and should be replaced with accession numbers.

ADD REPLYlink written 6 months ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 869 users visited in the last hour