Map Protein Accession Number to Uniprot identifier in R
1
0
Entering edit mode
4.0 years ago
jose.wo ▴ 10

Hello,

I have a list of proteins which are identified by what the author of the list called "Protein Accession Number". When I look for this number in the ncbi protein search I do find the protein.

For example: The protein accession number 29436380 gives as a result the "MYH9 protein" and in the address of the website I see that the number is there: https://www.ncbi.nlm.nih.gov/protein/29436380

I have looked all over but I couldn't find what this identifier really is and I would like to use the list that I have in R with other bioconductor packages. For that I would need to match it with uniprot IDs but I can't find any package to do this, mainly because I don't know how this number identifier is called.

Can anyone help me by either pointing me to a package that can do this mapping or telling me the name of this "accession number"?

Best, José.

R protein accession number uniprot nbci • 2.7k views
ADD COMMENT
0
Entering edit mode

it looks like a GI accession (accessions with only numbers) - more on them : https://www.ncbi.nlm.nih.gov/genbank/sequenceids/

Not entirely sure, but maybe biomaRt might be useful to you, documentation for this : https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html. In your case you might need to play around with this package and find out if there is an "attribute" in biomaRt for GIs (just like there is for ensembl_gene_ids, uniprotswissprot ids for example). If yes, then you can query ncbi proteins using those list of GIs (I have a feeling that you might also need the sequence accession, but I might be wrong) and get meta or sequence info.

ADD REPLY
0
Entering edit mode

Thanks a lot, you pointed me in the right direction!!!

ADD REPLY
2
Entering edit mode
4.0 years ago
jose.wo ▴ 10

Thanks a lot manaswwm!!! You pointed me in the right direction. What I finally did was to scrap the ncbi site to download the relevant information. I still have the problem that some records were removed or are obsolete but there is no way round that other than manually getting the information.

Here I leave you the code I used to retrieve the relevant information in case it can help someone.

require("rvest")
require("stringr")
require("dplyr")

extractInfo <- Vectorize(function(GInumber){
    tempPage  <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
    tempPage  <- base::paste(tempPage, collapse = "")
    Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
    Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
    GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
    out       <- paste(Symbol, Accession, GeneID, sep = "---")
    return(out)
})

You only need to pass the GI accession number to the "extractInfo" function and voila!! I have Vectorized the function because I needed to use it in a dplyr pipe operator.

Thanks again, José.

ADD COMMENT
0
Entering edit mode

Where possible don't use gi numbers for analysis. They are deprecated for end user use and should be replaced with accession numbers.

ADD REPLY

Login before adding your answer.

Traffic: 1897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6