Question: Bioinformatics: I have a list of Protein RefSEQ IDs returned by BLAST and I need to convert them to Entrez Gene IDs
0
gravatar for tom5
3 months ago by
tom50
tom50 wrote:

Hi, I hope you're well. I have a list of (1000+) Protein RefSeq IDs returned by BLAST and I need to convert them to Entrez gene IDs. Is there a way to do so? Thank you for the help!

blast entrez gene id R • 237 views
ADD COMMENTlink written 3 months ago by tom50

Hello tom5!

You have already received answers for this in your last question :

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax87k

Hi, I apologize for the similar question. However, my previous question dealt with converting protein refSeq IDs to Ensembl or Entrez gene accessions. I am now trying to convert from protein refSeq ID to Entrez gene ID. I know these are very similar tasks but I am not familiar enough with Entrez Direct to generalize the previous reply to this task.

ADD REPLYlink written 3 months ago by tom50

Answer in: C: Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession will work. If it does not then can you post a couple of examples.

ADD REPLYlink written 3 months ago by genomax87k

Yes, as an example, I want to convert the refseq ID 'NP_001026105.1' to the corresponding entrez gene ID: 420087. Is there a way to do so? My file has 1000+ refseq IDs (one per line) and I want to convert them to corresponding gene IDs. I'm sorry if you have already explained it.

ADD REPLYlink written 3 months ago by tom50
1
gravatar for genomax
3 months ago by
genomax87k
United States
genomax87k wrote:

This would work:

$ esearch -db protein -query "NP_001026105" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id
420087

For a lot of them you would do:

$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id

You should stay away from the gi numbers since they are now deprecated for end-user use.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax87k

Thank you for the help! Another question in this thread. Is there a way to get the gene descriptions (the functional role of each gene) for each entry? Something like for refseq ID: "NP_001026015.1", corresponding to gene symbol "AAR2", the description returned is "AAR2 splicing factor homolog [Source:NCBI gene;Acc:419118]"

I also worried that this command may take too long to run and terminal may time out with too many entries (currently around ~1500). Is this a legitimate concern?

ADD REPLYlink written 3 months ago by tom50
1
$ esearch -db protein -query "NP_001026015" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
AAR2    AAR2 splicing factor homolog

Sign up for and use NCBI API key as described here. That should allow you to go through your list.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax87k

Thanks! How would I run this for a file with multiple ref sequences? I feel that it will be similar to your reply above using epost, but am not certain how to implement this.

I just created the API key. However, how do I use the API with these calls, as the documentation is a little confusing.

ADD REPLYlink written 3 months ago by tom50
1

You can export a variable in the terminal you are doing these searches in by export NCBI_API_KEY='your_key_string'. You can also add this to your shell initialization file so it is exported when you log in.

$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
ADD REPLYlink modified 3 months ago • written 3 months ago by genomax87k

Hi, thank you for this helpful post! Is it possible to also output the original refseq ID so I can tie the results back to the input data? I will need to do some comparisons with this data.

ADD REPLYlink written 3 months ago by tom50
1
for line in $(cat accession_file); do printf ${line}"\t"; esearch -db protein -query $line | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description; done

NP_001026015    AAR2    AAR2 splicing factor homolog
NP_001026025    TTPAL   alpha tocopherol transfer protein like
ADD REPLYlink modified 3 months ago • written 3 months ago by genomax87k

Thanks! This is really helpful

ADD REPLYlink written 3 months ago by tom50

You could accept the answer (green check mark) to provide closure to this thread then.

ADD REPLYlink written 3 months ago by genomax87k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour