Map Protein Accession Number to Uniprot identifier in R
1
0
Entering edit mode
18 months ago
jose.wo ▴ 10

Hello,

I have a list of proteins which are identified by what the author of the list called "Protein Accession Number". When I look for this number in the ncbi protein search I do find the protein.

For example: The protein accession number 29436380 gives as a result the "MYH9 protein" and in the address of the website I see that the number is there: https://www.ncbi.nlm.nih.gov/protein/29436380

I have looked all over but I couldn't find what this identifier really is and I would like to use the list that I have in R with other bioconductor packages. For that I would need to match it with uniprot IDs but I can't find any package to do this, mainly because I don't know how this number identifier is called.

Can anyone help me by either pointing me to a package that can do this mapping or telling me the name of this "accession number"?

Best, José.

R protein accession number uniprot nbci • 919 views
0
Entering edit mode

it looks like a GI accession (accessions with only numbers) - more on them : https://www.ncbi.nlm.nih.gov/genbank/sequenceids/

Not entirely sure, but maybe biomaRt might be useful to you, documentation for this : https://www.bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html. In your case you might need to play around with this package and find out if there is an "attribute" in biomaRt for GIs (just like there is for ensembl_gene_ids, uniprotswissprot ids for example). If yes, then you can query ncbi proteins using those list of GIs (I have a feeling that you might also need the sequence accession, but I might be wrong) and get meta or sequence info.

0
Entering edit mode

Thanks a lot, you pointed me in the right direction!!!

2
Entering edit mode
18 months ago
jose.wo ▴ 10

Thanks a lot manaswwm!!! You pointed me in the right direction. What I finally did was to scrap the ncbi site to download the relevant information. I still have the problem that some records were removed or are obsolete but there is no way round that other than manually getting the information.

Here I leave you the code I used to retrieve the relevant information in case it can help someone.

require("rvest")
require("stringr")
require("dplyr")

extractInfo <- Vectorize(function(GInumber){
tempPage  <- base::paste(tempPage, collapse = "")
Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
out       <- paste(Symbol, Accession, GeneID, sep = "---")
return(out)
})


You only need to pass the GI accession number to the "extractInfo" function and voila!! I have Vectorized the function because I needed to use it in a dplyr pipe operator.

Thanks again, José.

0
Entering edit mode

Where possible don't use gi numbers for analysis. They are deprecated for end user use and should be replaced with accession numbers.