Question

Retrieving Sequences using NCBI Gene database IDs

0

Entering edit mode

9.8 years ago

gavingray1729 • 0

I'm trying to automatically retrieve sequences for genes defined by NCBI Gene identifiers. Example gene ID: 114787. Page on the Gene database site for this is: http://www.ncbi.nlm.nih.gov/gene/?term=114787%5Buid%5D

There's links on that page to the nucleotide database to get sequences for this gene in FASTA format, which is what I want. But, I can't query the nucleotide database with Biopython through the Efetch service because the IDs are different. I've tried using the elink service to map from Gene ID to nucleotide ID but I just get a massive list of IDs out, which can't be right.

How should I be doing this for a large number of Entrez Gene IDs? Preferably with Biopython.

Sequence Biopython Gene • 6.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by gavingray1729 • 0

Ram · Answer 1 · 2014-06-23

0

Entering edit mode

9.8 years ago

Pierre Lindenbaum 161k

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=gene shows that you can restrict the output of Elink to the refseq sequences linkname=gene_nuccore_refseqrna.

The query for NOTCH2 would be: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=nucleotide&id=4853&linkname=gene_nuccore_refseqrna

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks, I was about to close this question after I found your answer to this question: Get Fasta File With Protein Sequences Given Entrez Gene Ids which it turns out is exactly what I really wanted to do.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by gavingray1729 • 0

0

Entering edit mode

Although, your script as written there fails to run. There's a problem with the tab delimiters (maybe they got reformatted when you pasted it in here?). I replaced them with $'\t' but the script just hangs.

Writing my own code it looks like each Gene ID maps to multiple protein IDs. You say in the comments on the other post that I could just select any of these protein IDs and it wouldn't matter. Do you mean that for a set of protein IDs which map to a single Gene ID they will return the same sequence?

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by gavingray1729 • 0