Question: Python Programmatically obtaining the corresponding wildtype DNA sequence of a RefSeq Protein Identifier (e.g WP_016693432.1)
gravatar for ahir29
2.3 years ago by
ahir2910 wrote:

Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.

For example, I have a fasta file after a blast search -

>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus]
>WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans]
>WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous]

I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).

Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!

Thank you

ADD COMMENTlink modified 24 months ago by Biostar ♦♦ 20 • written 2.3 years ago by ahir2910
gravatar for genomax
2.3 years ago by
United States
genomax92k wrote:

By representing identical proteins using a single non-redundant protein accession number (with the prefix WP_), redundancy in the database is significantly reduced.

So WP numbers actually represent multiple proteins though only one is indicated in the header. More information here.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax92k

Thanks genomax - I cant believe my vigilant googling didnt get me to that page explaining where all the transcripts of a reference seq is kept! Need to improve my searching skills:) I am assuming now using the Entrez etuils wrapper in Biopython can get me the necessary transcripts. I will take a look at doing it myself!

ADD REPLYlink written 2.3 years ago by ahir2910

That may not be directly possible since WP accessions are not linked on backend with a single DNA sequence. I did try to see if efetch worked briefly (it did not).

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax92k

Is there no way to use the Entrez utils wrapper in biopython to access the "identical proteins" link ? I.e for ; this would lead to "" ; where any of the entries there gets me to a page which has a DNA transcript which I am happy to use. How do I get to that particular entry without web scraping or using any other tool that python/biopython?

ADD REPLYlink written 2.3 years ago by ahir2910

A lot of times I find eUtils to be in "oh so close but no cigar" category (Disc: I may be doing something wrong as well).

Something odd seems to be happening with querying ipg database and the retrieval format. You can test this and let me know if you see the same thing.

$ esearch -db ipg -query "WP_010594839.1" | efetch -format fasta
>CAA39212.1 extensin (class I) [Solanum lycopersicum]

$ esearch -db ipg -query "WP_010594839.1" | efetch -format docsum">

<DocumentSummarySet status="OK">

        <Title>NAD(P)-dependent alcohol dehydrogenase</Title>
        <ModificationDate>2018/05/08 00:00</ModificationDate>
        <BlastName>high GC Gram+</BlastName>

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax92k

Yup, i see the same thing. You are right, I am find the etuils hit and miss. I am resorting to webscraping the ipg page now. No other choice.

ADD REPLYlink written 2.3 years ago by ahir2910
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour