Question: How can I get sequencing data from NCBI with uniprot taxonomy identifiers? Automating with an API
gravatar for wewolf
5.1 years ago by
United States
wewolf0 wrote:


I am interested in downloading complete genomes to create a phylogenetic tree. The NCBI has a whole toolkit which they call Entrez Programming Utilities or eutils for short. (I found an EXCELLENT resource that walks me through everything I would need to know. Complete with a script in python to automate downloading these genomes off of NCBI.)

I have an "interesting-genomes.txt" file I'd like to find complete genomes for, HOWEVER this list of ID's contain the taxonomy identifier from uniprot ( ie

For example, Streptococcus mitis bv. 2 str. SK95, has the corresponding taxonomy number of 1000588 in uniprot. In NCBI, it's ID is NC_013853.

I have a file containing a long list of taxonomy identifiers like 1000588, and not the NCBI ID's of NC_013853. Any ideas on how I can get around this? 

Thank you!

ADD COMMENTlink modified 2.7 years ago by Biostar ♦♦ 20 • written 5.1 years ago by wewolf0

The NCBI ID you provided seems like the contig number and not the taxID. The taxid for your organism of interest is: Streptococcus mitis bv. 2 str. SK95 (taxid:1000588). Which is the same as the Uniprot database.

ADD REPLYlink written 2.7 years ago by theobroma221.1k
gravatar for onuralp
5.1 years ago by
onuralp190 wrote:

This is tricky because there are usually many assemblies or genomes available for a given taxon. When you try to map a taxon id back to genomes using, say, Batch Entrez, you will end up retrieving a huge amount of sequences associated with this taxon id. 

A possible way to get around this is to stick to representative genomes / assemblies, which guarantees you a one-to-one correspondence between taxon id and genome. In principle, this should work for almost all cases in your list excluding those that are sequenced very recently or have some weird strain-specific complications. 

Download the following file including information on species names and refseq complete genome ids:

Then you can write a simple script to parse this file and extract the corresponding accession number (e.g., NC_013853) for a given species (e.g., Streptococcus mitis). 

ADD COMMENTlink written 5.1 years ago by onuralp190

I think the new genomes ftp has a representative directory for every species like  

ADD REPLYlink written 5.1 years ago by Chris S.290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1922 users visited in the last hour