Question

converting RefSeq Protein (NCBI) accession numbers to Gene Symbol

0

Entering edit mode

3.4 years ago

NicoN64 ▴ 30

Hello,

I have a file including different accession numbers from RefSeq Protein of chicken genome and I want to convert them into the "Gene symbols". I tried to use bioDBnet - db2db to get the conversion. However when I want to donwload the full results (~50000) I only get in the downloaded file a part of the results (< 12000)

Is there an other way to run bioDBnet or an other tools to get the same results? (R packages, scripts)?

It tried also with BioMart but strangely I get 0 results. I selected chicken genes, then in filter Input external references ID list - RefsSeq pepitdes ID for my NP number. I tried also gProfiler but for many refsSeq I have none gene symbol

a part of my accession numbers are like below: NP_001001127.1 NP_999839.1 XP_001231206.2 XP_430508.3 YP_009555261.1

Thanks for the sugestions

gene database • 3.7k views

ADD COMMENT • link updated 3.4 years ago by GenoMax 141k • written 3.4 years ago by NicoN64 ▴ 30

0

Entering edit mode

I suspect if you remove the trailing .1 from the RefSeq IDs you'll get some results from Ensembl BioMart.

ADD REPLY • link 3.4 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

Yes you are right, removing .1 make it work. But still have missing results using biomart unfortunatly.

ADD REPLY • link 3.4 years ago by NicoN64 ▴ 30

0

Entering edit mode

If you're trying to map exclusively between NCBI identifiers, it's probably best to avoid Ensembl. An Ensembl ID will always be the reference in their datasets, so you end up doing more conversions behind the scenes with more potential for non-one-to-one mappings.

ADD REPLY • link 3.4 years ago by Mike Smith ★ 2.0k

score 3 · Accepted Answer · 2020-12-09

Using EntrezDirect.

$ more acc
NP_001001127.1
NP_999839.1
XP_001231206.2
XP_430508.3
YP_009555261.1

$ for i in `cat acc`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name; done
NP_001001127.1  EDNRB
NP_999839.1 PCDHGC3
XP_001231206.2  CD72L2
XP_430508.3 HEMGN
YP_009555261.1  ND1

OR

for i in `cat acc`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Id; done
NP_001001127.1  EDNRB   408082
NP_999839.1 PCDHGC3 408051
XP_001231206.2  CD72L2  768355
XP_430508.3 HEMGN   427378
YP_009555261.1  ND1 39105255

score 1 · Accepted Answer · 2020-12-09

1

Entering edit mode

3.4 years ago

vkkodali_ncbi ★ 3.7k

You can use NCBI Datasets for this. Specifically, you can use the command line tool as shown below:

$ datasets summary gene accession NP_001001127.1 NP_999839.1 XP_001231206.2 XP_430508.3 YP_009555261.1 \
  | jq -r '.genes[] | .gene | [.gene_id,.symbol,.chromosomes[],(try .transcripts[]).protein.accession_version,(try .proteins[]).accession_version] | @tsv'
39105255  ND1      MT  YP_009555261.1  
408051    PCDHGC3  13  NP_999839.1     
408082    EDNRB    1   NP_001001127.1  
427378    HEMGN    Z   XP_430508.3     XP_004949672.1
768355    CD72L2   Z   XP_001231206.2

Here, I am parsing the output from datasets which is in json format using jq to extract a table of gene_id and symbol but you can use any tool that can parse json files.

ADD COMMENT • link 3.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

What's the ordering of the results? For me your first input ID (NP_001001127.1) maps to the third line of the results (EDNRB) (NCBI link)

ADD REPLY • link 3.4 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

Numbers in column 1 are NCBI gene ID records. Here is one for EDNRB.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

My question was really about the row ordering. Why does the 1st input value match the 3rd output, or perhaps more usefully, is there some way to also return the input values so you can at least see the pairing?

ADD REPLY • link 3.4 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

Hi, thank looks to work. But same question what the ordering of the results? Mabye there is a way to get a column with the accession numbers submitted? Also is ti possible to give him a list of accession number in a file.txt (ad I have ~30000 number). Or I do a while read l?

ADD REPLY • link 3.4 years ago by NicoN64 ▴ 30

1

Entering edit mode

I updated my answer now to show the protein accessions as well. Note, it adds additional columns to genes that encode >1 isoforms.

ADD REPLY • link 3.4 years ago by vkkodali_ncbi ★ 3.7k