Python Programmatically obtaining the corresponding wildtype DNA sequence of a RefSeq Protein Identifier (e.g WP_016693432.1)
1
1
Entering edit mode
5.7 years ago
ahir29 ▴ 10

Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.

For example, I have a fasta file after a blast search -

>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus]
MKALQYTEIGSEPVVVDVPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYIYGLPLTLGHEGVGTVAELGAGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDTAQQVVAIDGHISVVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMDVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTVAELGDGVTGFET 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH 
AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TVVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG
>WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous]
MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTIAELGAGVTGFEK 
GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAAPLTDAGLTPYH 
AISRVLPLLGPGSMAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA 
TAVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF 
TLDEGPTAYRRLREGSIRGRGVVVPG

I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1) - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).

Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!

Thank you

dna database mapping Refseq transcript • 2.0k views
ADD COMMENT
2
Entering edit mode
5.7 years ago
GenoMax 141k

By representing identical proteins using a single non-redundant protein accession number (with the prefix WP_), redundancy in the database is significantly reduced.

So WP numbers actually represent multiple proteins though only one is indicated in the header. More information here.

ADD COMMENT
0
Entering edit mode

Thanks genomax - I cant believe my vigilant googling didnt get me to that page explaining where all the transcripts of a reference seq is kept! Need to improve my searching skills:) I am assuming now using the Entrez etuils wrapper in Biopython can get me the necessary transcripts. I will take a look at doing it myself!

ADD REPLY
0
Entering edit mode

That may not be directly possible since WP accessions are not linked on backend with a single DNA sequence. I did try to see if efetch worked briefly (it did not).

ADD REPLY
0
Entering edit mode

Is there no way to use the Entrez utils wrapper in biopython to access the "identical proteins" link ? I.e for https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1 ; this would lead to "https://www.ncbi.nlm.nih.gov/ipg/?term=WP_010594839.1" ; where any of the entries there gets me to a page which has a DNA transcript which I am happy to use. How do I get to that particular entry without web scraping or using any other tool that python/biopython?

ADD REPLY
0
Entering edit mode

A lot of times I find eUtils to be in "oh so close but no cigar" category (Disc: I may be doing something wrong as well).

Something odd seems to be happening with querying ipg database and the retrieval format. You can test this and let me know if you see the same thing.

$ esearch -db ipg -query "WP_010594839.1" | efetch -format fasta
>CAA39212.1 extensin (class I) [Solanum lycopersicum]
PPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPSPSPPPPYYYKSPPPPDPSP
PPPYYYKSPPPPSPSPPPPSPSPPPPTYSSPPPPPPFYENIPLPPVIGVSYASPPPPVIPYY

$ esearch -db ipg -query "WP_010594839.1" | efetch -format docsum

https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20171128/esummary_ipg.dtd">

<DocumentSummarySet status="OK">
<DbBuild>Build180808-2026m.1</DbBuild>

<DocumentSummary><Id>1345531</Id>
        <Caption>WP_010594839</Caption>
        <Title>NAD(P)-dependent alcohol dehydrogenase</Title>
        <IPG>1345531</IPG>
        <TaxId>1827</TaxId>
        <ProteinCount>27</ProteinCount>
        <Accession>WP_010594839.1</Accession>
        <ModificationDate>2018/05/08 00:00</ModificationDate>
        <Slen>346</Slen>
        <Organism>Rhodococcus</Organism>
        <BlastName>high GC Gram+</BlastName>
        <Div>BCT</Div>
        <TaxIdCount>5</TaxIdCount>
        <SpeciesCount>2</SpeciesCount>
        <GeneraCount>1</GeneraCount>
        <AssemblyCount>14</AssemblyCount>
        <KingdomCount>1</KingdomCount>
        <GenomeCount>0</GenomeCount>
        <RowCount>37</RowCount>
</DocumentSummary>

</DocumentSummarySet>
ADD REPLY
0
Entering edit mode

Yup, i see the same thing. You are right, I am find the etuils hit and miss. I am resorting to webscraping the ipg page now. No other choice.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6