Question: how to convert NCBI protein ID to corresponding nucleotide sequence
0
gravatar for natasha.sernova
4.4 years ago by
natasha.sernova3.5k
natasha.sernova3.5k wrote:

Dear ALL,

I have a set of NCBI protein IDs. I know how to convert them to the protein sequences using e-utulility tools.

#!/usr/bin/perl

use strict;

use LWP::UserAgent;

my $ua = new LWP::UserAgent;

my $prot_id="WP_005451061.1";

my $response = $ua->get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=' . $prot_id);

unless ($response->is_success) {

     die $response->status_line;

}

my $content = $response->decoded_content();

if (utf8::is_utf8($content)) {

     binmode STDOUT,':utf8';

} else {

     binmode STDOUT,':raw';

}

print $content;

I use only one protein as an example, but I know e-utilities allow a batch technique, etc, so it is not critical.

But I would like to find a way to convert any NCBI protein Id to the original nucleotide source, mRNA or whatever. I deal with bacteria, so introns, etc are not a problem. I saw a probable tool to do it in e-utilities. But I failed to finish with the nucleotide sequence, - I realized that the protein ID will change. Biomart doesn't help me so far.

Probably I have to use something like that in e-utilities (IDs are optional):

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902&cmd=neighbor_history

It gives me just some xml-file, but how can I transfer it to nucleotides?

Could you, please, help me? Many thanks!

Natalia

databases protein sequence gene • 3.5k views
ADD COMMENTlink modified 4.4 years ago by David W4.7k • written 4.4 years ago by natasha.sernova3.5k
2
gravatar for David W
4.4 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

You can use the link Eutil to find linked records (there will be an "IdList" in the resultant xml), but note

(a) You have to use the gi number (not the accession as you have above) for the link Eutil.

(b) There may be multiple nucleotide records linked to a protein, and they may be much larger than the particular protein sequence (both true in this case). 

Here's how it would work-flow might look like in the R package rentrez, you can no doubt adapt the following to perl or Your Favourtie Scripting Language

(search <- entrez_search(db="protein", term="WP_005451061[Accn]"))
#Entrez search result with 1 hits (object contains 1 IDs and no cookie)

(links <- entrez_link(dbfrom="protein", db="nuccore", id=search$ids))
# elink result with ids from 3 databases:
# [1] protein_nuccore     protein_nuccore_wgs protein_nuccore_wp
length(links$protein_nuccore)
[1] 5

rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$protein_nuccore[1])
nchar(rec)
# [1] 77201
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by David W4.7k

Thank you very much, David! I will try.

ADD REPLYlink written 4.4 years ago by natasha.sernova3.5k

Dear David,

Reading http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch

I’ve found only the following:

Sequences

Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=sequences&id=312836839,34577063&rettype=fasta&retmode=text

The link really leads to the seqs from NCBI.

http://www.ncbi.nlm.nih.gov/nuccore/312836839

http://www.ncbi.nlm.nih.gov/protein/34577063

But is there any way to isolate corresponding “pure” nucleotide sequence from its start to the the end? Only CDs for this protein? I think, no. Is it correct?

And nucleotide IDs are quite different from protein IDs.

Is it possible to find the nucleotide ID, having only the protein ID or GI-number?

Please, give me a hint. Sorry, I have not tried your R-script yet, maybe, it will give me a solution.

Many thanks!

Sincerely yours,

Natalia

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by natasha.sernova3.5k

Hi Natalia, 

But is there any way to isolate corresponding “pure” nucleotide sequence from its start to the the end? Only CDs for this protein? I think, no. Is it correct?

As far as I know this is correct. But check out "fetaures" of teh nucleotide record (in ganbank format), which might give the indices of your gene on interest

Is it possible to find the nucleotide ID, having only the protein ID or GI-number?

You can get linked nucleotide IDs from protein IDs (but not accesion) with elink. You can get protein IDs from protein acessions with esearch (using the query I have in the code above)

ADD REPLYlink written 4.3 years ago by David W4.7k

Thank you, David, I hope it will help.

 

ADD REPLYlink written 4.3 years ago by natasha.sernova3.5k

Hi David, I tried your code, but it always for the last step, it also gives me "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109", even if I have input different protein accession numbers. So, does it mean that I input something wrong? Thank you in advance! Bing

ADD REPLYlink written 9 months ago by bison1000

I also want to get the corresponding nucleotide sequence for each protein sequence from NCBI, because Uniprot doesn't provide this service now.

ADD REPLYlink written 9 months ago by bison1000
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 932 users visited in the last hour