Question: Python Programmatically obtaining the corresponding wildtype DNA sequence of a RefSeq Protein Identifier (e.g WP_016693432.1)
1
gravatar for ahir29
4 months ago by
ahir2910
ahir2910 wrote:

Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.

For example, I have a fasta file after a blast search -

>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus]
MKALQYTEIGSEPVVVDVPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYIYGLPLTLGHEGVGTVAELGAGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDTAQQVVAIDGHISVVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMDVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTVAELGDGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TVVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTIAELGAGVTGFEK 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAAPLTDAGLTPYH 
AISRVLPLLGPGSMAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG

I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1) - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).

Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!

Thank you

ADD COMMENTlink modified 4 days ago by Biostar ♦♦ 20 • written 4 months ago by ahir2910
2
gravatar for genomax
4 months ago by
genomax59k
United States
genomax59k wrote:

By representing identical proteins using a single non-redundant protein accession number (with the prefix WP_), redundancy in the database is significantly reduced.

So WP numbers actually represent multiple proteins though only one is indicated in the header. More information here.

ADD COMMENTlink modified 4 months ago • written 4 months ago by genomax59k

Thanks genomax - I cant believe my vigilant googling didnt get me to that page explaining where all the transcripts of a reference seq is kept! Need to improve my searching skills:) I am assuming now using the Entrez etuils wrapper in Biopython can get me the necessary transcripts. I will take a look at doing it myself!

ADD REPLYlink written 4 months ago by ahir2910

That may not be directly possible since WP accessions are not linked on backend with a single DNA sequence. I did try to see if efetch worked briefly (it did not).

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax59k

Is there no way to use the Entrez utils wrapper in biopython to access the "identical proteins" link ? I.e for https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1 ; this would lead to "https://www.ncbi.nlm.nih.gov/ipg/?term=WP_010594839.1" ; where any of the entries there gets me to a page which has a DNA transcript which I am happy to use. How do I get to that particular entry without web scraping or using any other tool that python/biopython?

ADD REPLYlink written 4 months ago by ahir2910

A lot of times I find eUtils to be in "oh so close but no cigar" category (Disc: I may be doing something wrong as well).

Something odd seems to be happening with querying ipg database and the retrieval format. You can test this and let me know if you see the same thing.

$ esearch -db ipg -query "WP_010594839.1" | efetch -format fasta
>CAA39212.1 extensin (class I) [Solanum lycopersicum]
PPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPDPSP
PPPYYYKSPPPPSPSPPPPSPSPPPPTYSSPPPPPPFYENIPLPPVIGVSYASPPPPVIPYY

$ esearch -db ipg -query "WP_010594839.1" | efetch -format docsum

https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20171128/esummary_ipg.dtd">

<DocumentSummarySet status="OK">
<DbBuild>Build180808-2026m.1</DbBuild>

<DocumentSummary><Id>1345531</Id>
        <Caption>WP_010594839</Caption>
        <Title>NAD(P)-dependent alcohol dehydrogenase</Title>
        <IPG>1345531</IPG>
        <TaxId>1827</TaxId>
        <ProteinCount>27</ProteinCount>
        <Accession>WP_010594839.1</Accession>
        <ModificationDate>2018/05/08 00:00</ModificationDate>
        <Slen>346</Slen>
        <Organism>Rhodococcus</Organism>
        <BlastName>high GC Gram+</BlastName>
        <Div>BCT</Div>
        <TaxIdCount>5</TaxIdCount>
        <SpeciesCount>2</SpeciesCount>
        <GeneraCount>1</GeneraCount>
        <AssemblyCount>14</AssemblyCount>
        <KingdomCount>1</KingdomCount>
        <GenomeCount>0</GenomeCount>
        <RowCount>37</RowCount>
</DocumentSummary>

</DocumentSummarySet>
ADD REPLYlink modified 4 months ago • written 4 months ago by genomax59k

Yup, i see the same thing. You are right, I am find the etuils hit and miss. I am resorting to webscraping the ipg page now. No other choice.

ADD REPLYlink written 4 months ago by ahir2910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1163 users visited in the last hour