I'm trying to automatically retrieve sequences for genes defined by NCBI Gene identifiers. Example gene ID: 114787. Page on the Gene database site for this is: http://www.ncbi.nlm.nih.gov/gene/?term=114787%5Buid%5D
There's links on that page to the nucleotide database to get sequences for this gene in FASTA format, which is what I want. But, I can't query the nucleotide database with Biopython through the Efetch service because the IDs are different. I've tried using the elink service to map from Gene ID to nucleotide ID but I just get a massive list of IDs out, which can't be right.
How should I be doing this for a large number of Entrez Gene IDs? Preferably with Biopython.
Thanks, I was about to close this question after I found your answer to this question: Get Fasta File With Protein Sequences Given Entrez Gene Ids which it turns out is exactly what I really wanted to do.
Although, your script as written there fails to run. There's a problem with the tab delimiters (maybe they got reformatted when you pasted it in here?). I replaced them with $'\t' but the script just hangs.
Writing my own code it looks like each Gene ID maps to multiple protein IDs. You say in the comments on the other post that I could just select any of these protein IDs and it wouldn't matter. Do you mean that for a set of protein IDs which map to a single Gene ID they will return the same sequence?