Question: Retrieving Sequences using NCBI Gene database IDs
0
gravatar for gavingray1729
5.2 years ago by
United Kingdom
gavingray17290 wrote:

I'm trying to automatically retrieve sequences for genes defined by NCBI Gene identifiers. Example gene ID: 114787. Page on the Gene database site for this is: http://www.ncbi.nlm.nih.gov/gene/?term=114787%5Buid%5D

 

There's links on that page to the nucleotide database to get sequences for this gene in FASTA format, which is what I want. But, I can't query the nucleotide database with Biopython through the Efetch service because the IDs are different. I've tried using the elink service to map from Gene ID to nucleotide ID but I just get a massive list of IDs out, which can't be right.

 

How should I be doing this for a large number of Entrez Gene IDs? Preferably with Biopython.

biopython sequence gene • 5.1k views
ADD COMMENTlink modified 5.2 years ago by Pierre Lindenbaum122k • written 5.2 years ago by gavingray17290
0
gravatar for Pierre Lindenbaum
5.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=gene shows that you can restrict the output of Elink to the refseq sequences `linkname=gene_nuccore_refseqrna`.

 

the query for NOTCH2 would be : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=nucleotide&id=4853&linkname=gene_nuccore_refseqrna

 

 

ADD COMMENTlink written 5.2 years ago by Pierre Lindenbaum122k

Thanks, I was about to close this question after I found your answer to this question: Get Fasta File With Protein Sequences Given Entrez Gene Ids which it turns out is exactly what I really wanted to do.

ADD REPLYlink written 5.2 years ago by gavingray17290

Although, your script as written there fails to run. There's a problem with the tab delimiters (maybe they got reformatted when you pasted it in here?). I replaced them with $'\t' but the script just hangs.

Writing my own code it looks like each Gene ID maps to multiple protein IDs. You say in the comments on the other post that I could just select any of these protein IDs and it wouldn't matter. Do you mean that for a set of protein IDs which map to a single Gene ID they will return the same sequence?

ADD REPLYlink written 5.2 years ago by gavingray17290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1614 users visited in the last hour