Question: Retrieving Sequences using NCBI Gene database IDs
6.0 years ago by
United Kingdom
gavingray17290 wrote:

I'm trying to automatically retrieve sequences for genes defined by NCBI Gene identifiers. Example gene ID: 114787. Page on the Gene database site for this is:


There's links on that page to the nucleotide database to get sequences for this gene in FASTA format, which is what I want. But, I can't query the nucleotide database with Biopython through the Efetch service because the IDs are different. I've tried using the elink service to map from Gene ID to nucleotide ID but I just get a massive list of IDs out, which can't be right.


How should I be doing this for a large number of Entrez Gene IDs? Preferably with Biopython.

biopython sequence gene • 5.6k views
modified 6.0 years ago by Pierre Lindenbaum129k • written 6.0 years ago by gavingray17290
6.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote: shows that you can restrict the output of Elink to the refseq sequences `linkname=gene_nuccore_refseqrna`.


the query for NOTCH2 would be :



written 6.0 years ago by Pierre Lindenbaum129k

Thanks, I was about to close this question after I found your answer to this question: Get Fasta File With Protein Sequences Given Entrez Gene Ids which it turns out is exactly what I really wanted to do.

written 6.0 years ago by gavingray17290

Although, your script as written there fails to run. There's a problem with the tab delimiters (maybe they got reformatted when you pasted it in here?). I replaced them with $'\t' but the script just hangs.

Writing my own code it looks like each Gene ID maps to multiple protein IDs. You say in the comments on the other post that I could just select any of these protein IDs and it wouldn't matter. Do you mean that for a set of protein IDs which map to a single Gene ID they will return the same sequence?

written 6.0 years ago by gavingray17290
