Question: Find an existing nucleotide sequence for a specific protein sequence through NCBI eutils
gravatar for Jenez
4.6 years ago by
Jenez520 wrote:


I am attempting to find corresponding nucleotide sequences to a list of protein sequences that I have attained from a blastp search. So far I've attempted to find these sequences through the use of NCBI's Eutils. 

I've attempted to Esearch the protein database with the given protein ID's I have, which return result object which can be further used with Elink. With Elink, I've targeted the nuccore database and gene database to find connections between protein sequence and nucleotide sequence. This has worked to some degree, but the problem I'm facing is that there seems to be no single good robust way to find a corresponding nucleotide sequence for a given protein sequence. It works for the most part, but sometimes, for example when using a protein ID that corresponds to a multispecies entry, there will be no gene link. If you instead try to elink to nuccore, you sometimes miss information in the xml output that is essential to picking out the right sequence.

Is there a robust, always working method to doing this? Every venue turns out to be more complicated than it should be.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Jenez520

Did you ever find a way to do this? I am attempting to find the longest transcripts for each gene, so far, I can pull out all of the proteins for a gene, find the protein with the longest length, and then I want the coding sequence for this protein. I've asked my question here and here but so far no luck! I cannot do it manually as there are too many sequences to get.

ADD REPLYlink written 4.0 years ago by Tom30

What I ended up doing was to query with eutils for a so called identical protein report (IPG).

From here, you often (not always) find a link from protein to nucleotide sequence. I've set it up such that I can query NCBI for the IPG report in xml format, parse out the information that i want (namely nucleotide accession and coordinates, as well as strand orientation). Using the nucleotide accession, i make another query to retrieve the xml report for it, and using the strand information i retrieved earlier i can parse out the nucleotide sequence i want.

It's not beautiful by any means, it often crashes and requires manual fixing when the IPG reports look weird, and there's probably a billion ways of doing it better.

Also, if you have many sequences you need to retrieve then this might not be the best idea as it might take a while and it would put quite a heavy load on the ncbi servers.

ADD REPLYlink written 3.9 years ago by Jenez520
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour