Question: Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession
0
gravatar for tom5
8 months ago by
tom50
tom50 wrote:

Hi, I hope you are doing well. I ran BLAST alignment on a multi-gene FASTA file and return the top hit for each gene as a refseq protein ID (such as NP_001229937.1). I want to convert these protein IDs to Entrez Gene Accessions or Ensembl IDs. Is there a way to do so programmatically? I am working in R. I tried Biomart but it returned no matches for some of the input refseq protein IDs.

entrez R gene • 267 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by tom50
1
gravatar for GenoMax
8 months ago by
GenoMax94k
United States
GenoMax94k wrote:

Using EntrezDirect:

$ esearch -db protein -query "NP_001229937" | elink -target nuccore | efetch -format acc
NM_001243008.1
NC_000067.6

OR

$ esearch -db protein -query "NP_001229937" | elink -target gene | efetch -format ft

1. Col6a3
Official Symbol: Col6a3 and Name: collagen, type VI, alpha 3 [Mus musculus (house mouse)]
Other Aliases: AI507288, Col6a-3
Other Designations: collagen alpha-3(VI) chain; collagen alpha 3 chain type VI; collagen alpha3(VI); procollagen, type VI, alpha 3; type VI collagen alpha 3 subunit
Chromosome: 1; Location: 1 45.53 cM
Annotation: Chromosome 1 NC_000067.6 (90766860..90844001, complement)
ID: 12835
ADD COMMENTlink written 8 months ago by GenoMax94k

Thank you! Is there a way to pass in a file with multiple ref seq IDs at once? Such as instead of query I were to use -Input "file name". I just need the gene symbol (and not the other information) for each Ref_seq ID. My final goal is a table of gene symbols corresponding to the input file of ref seq IDs.

ADD REPLYlink modified 8 months ago • written 8 months ago by tom50

Use something like (file with one accession per line, file.txt) :

cat file.txt | epost -db protein -format acc | elink -target nuccore | efetch -format acc

To get GeneNames:

$ esearch -db protein -query "NP_001229937" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name
Col6a3
ADD REPLYlink modified 8 months ago • written 8 months ago by GenoMax94k

Hi thank you for the quick reply! The second command you shared is exactly what I need, returning the gene symbol. However, I am not sure how to pass in a file of multiple sequences (one accession per line) to this command. Could you explain how to do something like this?

ADD REPLYlink written 8 months ago by tom50
1
cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name

file.txt should contain one accession per line.

ADD REPLYlink written 8 months ago by GenoMax94k
0
gravatar for brianj.park
8 months ago by
brianj.park50
Montréal, Canada
brianj.park50 wrote:

You can use org.Mm.eg.db.

library(org.Mm.eg.db) 
Mm <- org.Mm.eg.db
my_symbol <- "NP_001229937"
select(mm, keys = my_symbol, columns = c("REFSEQ", "ENSEMBL"), keytype = "REFSEQ")

 REFSEQ            ENSEMBL
1 NP_001229937 ENSMUSG00000048126
ADD COMMENTlink modified 8 months ago • written 8 months ago by brianj.park50

Thanks! However, when I try the ref seq "NP_033865.2", this returns 'None of the keys entered are valid keys for 'REFSEQ'". I double checked on NCBI and this is a valid gene entry. Please let me know if there's a way to resolve this issue. Thank you for your help!

ADD REPLYlink written 8 months ago by tom50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2446 users visited in the last hour