Question

Bioinformatics: I have a list of Protein RefSEQ IDs returned by BLAST and I need to convert them to Entrez Gene IDs

0

Entering edit mode

5.2 years ago

tom5 • 0

Hi, I hope you're well. I have a list of (1000+) Protein RefSeq IDs returned by BLAST and I need to convert them to Entrez gene IDs. Is there a way to do so? Thank you for the help!

R BLAST entrez gene ID • 1.9k views

ADD COMMENT • link 5.2 years ago by tom5 • 0

0

Entering edit mode

Hello tom5!

You have already received answers for this in your last question :

Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLY • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Hi, I apologize for the similar question. However, my previous question dealt with converting protein refSeq IDs to Ensembl or Entrez gene accessions. I am now trying to convert from protein refSeq ID to Entrez gene ID. I know these are very similar tasks but I am not familiar enough with Entrez Direct to generalize the previous reply to this task.

ADD REPLY • link 5.2 years ago by tom5 • 0

0

Entering edit mode

Answer in: C: Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession will work. If it does not then can you post a couple of examples.

ADD REPLY • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Yes, as an example, I want to convert the refseq ID 'NP_001026105.1' to the corresponding entrez gene ID: 420087. Is there a way to do so? My file has 1000+ refseq IDs (one per line) and I want to convert them to corresponding gene IDs. I'm sorry if you have already explained it.

ADD REPLY • link 5.2 years ago by tom5 • 0

score 1 · Answer 1 · 2020-05-05

1

Entering edit mode

5.2 years ago

GenoMax 152k

This would work:

$ esearch -db protein -query "NP_001026105" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id
420087

For a lot of them you would do:

$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id

You should stay away from the gi numbers since they are now deprecated for end-user use.

ADD COMMENT • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Thank you for the help! Another question in this thread. Is there a way to get the gene descriptions (the functional role of each gene) for each entry? Something like for refseq ID: "NP_001026015.1", corresponding to gene symbol "AAR2", the description returned is "AAR2 splicing factor homolog [Source:NCBI gene;Acc:419118]"

I also worried that this command may take too long to run and terminal may time out with too many entries (currently around ~1500). Is this a legitimate concern?

ADD REPLY • link 5.2 years ago by tom5 • 0

1

Entering edit mode

$ esearch -db protein -query "NP_001026015" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
AAR2    AAR2 splicing factor homolog

Sign up for and use NCBI API key as described here. That should allow you to go through your list.

ADD REPLY • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Thanks! How would I run this for a file with multiple ref sequences? I feel that it will be similar to your reply above using epost, but am not certain how to implement this.

I just created the API key. However, how do I use the API with these calls, as the documentation is a little confusing.

ADD REPLY • link 5.2 years ago by tom5 • 0

1

Entering edit mode

You can export a variable in the terminal you are doing these searches in by export NCBI_API_KEY='your_key_string'. You can also add this to your shell initialization file so it is exported when you log in.

$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description

ADD REPLY • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Hi, thank you for this helpful post! Is it possible to also output the original refseq ID so I can tie the results back to the input data? I will need to do some comparisons with this data.

ADD REPLY • link 5.2 years ago by tom5 • 0

2

Entering edit mode

for line in $(cat accession_file); do printf ${line}"\t"; esearch -db protein -query $line | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description; done

NP_001026015    AAR2    AAR2 splicing factor homolog
NP_001026025    TTPAL   alpha tocopherol transfer protein like

ADD REPLY • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Thanks! This is really helpful

ADD REPLY • link 5.2 years ago by tom5 • 0

0

Entering edit mode

You could accept the answer (green check mark) to provide closure to this thread then.

ADD REPLY • link 5.2 years ago by GenoMax 152k