Question: Obtaining NCBI GI Numbers from Taxonomy ID (for Entrez efetch query)
3
gravatar for patroos
2.4 years ago by
patroos70
United States
patroos70 wrote:

I'm trying to automatically obtain fasta files from the NCBI nucleotide database for a list of taxonomy IDs. I know I can use Entrez's efetch but it expects a GI number, which I do not have a list of. Is there a way to fetch by taxonomy ID or a straight-forward and non-manual way to get GI numbers from taxonomy IDs? 

Thanks much for any help!

entrez nucleotide ncbi • 2.7k views
ADD COMMENTlink modified 2.2 years ago by natasha.sernova2.5k • written 2.4 years ago by patroos70
6
gravatar for 5heikki
2.4 years ago by
5heikki6.5k
Finland
5heikki6.5k wrote:

This is really simple with Entrez Direct

epost -db taxonomy -id 63221 | elink -target nuccore | efetch -format uid
196123578
677001457
634744538
634744524
2286205
584458899
513134556
398637208
315623200
270209679
270209678
270209677
262527002
253947345
253947331
253947317
253947303
253947289
222350099
222350097
195972535
158958247
158251955
158251954
111035029
91075865
28557455
11141613
11141612
7769684
4927255

Also works with -target nucgss if that's what you're interested in. You can also skip the gi part and efetch -format fasta

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by 5heikki6.5k
1

Fantastic, this is very useful. Thanks!

ADD REPLYlink written 2.4 years ago by patroos70
3
gravatar for patroos
2.4 years ago by
patroos70
United States
patroos70 wrote:

Haven't found a great solution for this. But the work-around I am using now is downloading /pub/taxonomy/gi_taxid_nucl.dmp.gz from the NCBI ftp server. The file maps GI numbers to taxonomy id, and I search it to get the GI numbers for a  given taxonomy ID. 

ADD COMMENTlink written 2.4 years ago by patroos70
2
gravatar for David W
2.4 years ago by
David W4.6k
New Zealand
David W4.6k wrote:

The elink util will turn up cross-references between NCBI databases.

In this case you want to find links between the taxonomy and nucleotide databases. Here's a demo using the R pacakge rentrez (there are similar libraries for pretty much all popular scripting languages, and even command line utils for this):

 

find sequenced linked to a taxid

tax_seqs <- entrez_link(db = "nuccore", dbfrom = "taxonomy", id=5911)
#elink result with ids from 2 databases:
#[1] taxonomy_nuccore        taxonomy_nucleotide_exp

grab them

tmp <- tempfile()
recs <- entrez_fetch(db="nuccore", id=tax_seqs$taxonomy_nuccore[1:3], rettype="fasta")
cat(recs, file=tmp)
ape::read.dna(tmp, format="fasta")
#3 DNA sequences in binary format stored in a list.
#
# Mean sequence length: 1493
# Shortest sequence: 1137
# Longest sequence: 1779
#
# Labels: gi|697738807|gb|KM406498.1| Tetrahymena thermophila Pat2 mRNA, complete cds #gi|697738801|gb|KM406497.1|  #Tetrahymena thermophila Tpt1 mRNA, complete cds #gi|697738796|gb|KM406496.1| Tetrahymena thermophila Pat1 mRNA, complete #cds 

 

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by David W4.6k
0
gravatar for natasha.sernova
2.2 years ago by
natasha.sernova2.5k
natasha.sernova2.5k wrote:

It's much simplier - see http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook.

Good luck!

ADD COMMENTlink written 2.2 years ago by natasha.sernova2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour