Dear All, The question is quite self explanatory - what is the easiest way (using python) to get the corresponding DNA sequence from which the translated protein sequence is available in the RefSeq NCBI database.
For example, I have a fasta file after a blast search -
>WP_010594839.1 MULTISPECIES: NAD(P)-dependent alcohol dehydrogenase [Rhodococcus] MKALQYTEIGSEPVVVDVPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYIYGLPLTLGHEGVGTVAELGAGVTGFET GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA TAVFDFVGAQSTIDTAQQVVAIDGHISVVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMDVVDLARAGRLDIHTETF TLDEGPTAYRRLREGSIRGRGVVVPG >WP_024100401.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus pyridinivorans] MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTVAELGDGVTGFET GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAVPLTDAGLTPYH AISRVLPLLGPGSTAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA TVVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF TLDEGPTAYRRLREGSIRGRGVVVPG >WP_016693432.1 NAD(P)-dependent alcohol dehydrogenase [Rhodococcus rhodochrous] MRALQYTEIGSEPVVVDLPTPAPGPGEILLKVTAAGLCHSDIFVMDMPAEQYAYGLPLTLGHEGVGTIAELGAGVTGFEK GDAVAVYGPWGCGACHACARGRENYCTRAAELGITPPGLGSPGSMAEYMIVDSARHLVPIGDLDPVAAAPLTDAGLTPYH AISRVLPLLGPGSMAVVIGVGGLGHVGIQILRAVSAARVIAVDLDDDRLALAREVGADAAVKSGAGAADAIRELTGGEGA TAVFDFVGAQSTIDMAQQVVAIDGHISIVGIHAGAHAKVGFFMIPFGASVVTPYWGTRSELMEVVDLARAGRLDIHTETF TLDEGPTAYRRLREGSIRGRGVVVPG
I am using Biopython currently to try and map the refseq ID (e.g WP_010594839.1) and the Entrez.fetch to somehow obtain the corresponding DNA sequence. But, I am finding there is no clear way even non-programmatically to go and do something like this. The closest I can get is getting the whole genome of the file (by clicking the "nucleotide" or "genome" links) on the right side of the webpage of WP_010594839.1 (i.e. https://www.ncbi.nlm.nih.gov/protein/WP_010594839.1) - (something I wouldnt know know how to do programmatically even if the whole genome was what I am after).
Can anyone please give me some advice on whether there is some super obvious way that I have totally missed, to get the corresponding coding DNA sequence of a protein given its reference number? Programmatically is preferred, but even manually is a good start for now!