Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession
2
0
Entering edit mode
2.0 years ago
tom5 • 0

Hi, I hope you are doing well. I ran BLAST alignment on a multi-gene FASTA file and return the top hit for each gene as a refseq protein ID (such as NP_001229937.1). I want to convert these protein IDs to Entrez Gene Accessions or Ensembl IDs. Is there a way to do so programmatically? I am working in R. I tried Biomart but it returned no matches for some of the input refseq protein IDs.

gene R Entrez • 786 views
ADD COMMENT
1
Entering edit mode
2.0 years ago
GenoMax 115k

Using EntrezDirect:

$ esearch -db protein -query "NP_001229937" | elink -target nuccore | efetch -format acc
NM_001243008.1
NC_000067.6

OR

$ esearch -db protein -query "NP_001229937" | elink -target gene | efetch -format ft

1. Col6a3
Official Symbol: Col6a3 and Name: collagen, type VI, alpha 3 [Mus musculus (house mouse)]
Other Aliases: AI507288, Col6a-3
Other Designations: collagen alpha-3(VI) chain; collagen alpha 3 chain type VI; collagen alpha3(VI); procollagen, type VI, alpha 3; type VI collagen alpha 3 subunit
Chromosome: 1; Location: 1 45.53 cM
Annotation: Chromosome 1 NC_000067.6 (90766860..90844001, complement)
ID: 12835
ADD COMMENT
0
Entering edit mode

Thank you! Is there a way to pass in a file with multiple ref seq IDs at once? Such as instead of query I were to use -Input "file name". I just need the gene symbol (and not the other information) for each Ref_seq ID. My final goal is a table of gene symbols corresponding to the input file of ref seq IDs.

ADD REPLY
0
Entering edit mode

Use something like (file with one accession per line, file.txt) :

cat file.txt | epost -db protein -format acc | elink -target nuccore | efetch -format acc

To get GeneNames:

$ esearch -db protein -query "NP_001229937" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name
Col6a3
ADD REPLY
0
Entering edit mode

Hi thank you for the quick reply! The second command you shared is exactly what I need, returning the gene symbol. However, I am not sure how to pass in a file of multiple sequences (one accession per line) to this command. Could you explain how to do something like this?

ADD REPLY
1
Entering edit mode
cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name

file.txt should contain one accession per line.

ADD REPLY
0
Entering edit mode
2.0 years ago
brianj.park ▴ 50

You can use org.Mm.eg.db.

library(org.Mm.eg.db) 
Mm <- org.Mm.eg.db
my_symbol <- "NP_001229937"
select(mm, keys = my_symbol, columns = c("REFSEQ", "ENSEMBL"), keytype = "REFSEQ")

 REFSEQ            ENSEMBL
1 NP_001229937 ENSMUSG00000048126
ADD COMMENT
0
Entering edit mode

Thanks! However, when I try the ref seq "NP_033865.2", this returns 'None of the keys entered are valid keys for 'REFSEQ'". I double checked on NCBI and this is a valid gene entry. Please let me know if there's a way to resolve this issue. Thank you for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2315 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6