Question: Ways To Programmatically Determine Whole Genome Completion Dates, Genomic Type, Given A Taxonomic Identifier?
gravatar for Bliptrip
6.9 years ago by
Milwaukee, WI
Bliptrip20 wrote:

I have been tasked with determining statistical information on a dataset of microbial protein sequences, most importantly the average number of proteins per genome (think 'strain' or 'isolate') at a given taxonomic level. Given that some proteins in our dataset will come from incomplete genomes, I decided that in order to make the statistics as accurate as possible, I needed to consider only 'complete' genomes. Moreover, I also need to have the creation date/submission date of the completed genome, due to the fact that the protein set I'm working with comes from PFAM25.0, which means it was grabbed from a May 2010 snapshot of UniprotKB. Hence, I really can't regard any completed genomes created after this date.

What I did to handle this, right or wrong, was use the NCBI e-utilities elink interface to map taxonomic ids to genome ids, and then used the e-utilities esummary interface to pull genomic metadata for the genome identifier. If the create date was correct, and the base pair number exceeded a given threshold (the only way I knew to distinguish a chromosome from a plasmid), then I classified the genome as 'complete' and counted all protein sequences in that strain as valid. Otherwise, I threw out sequences from 'incomplete' genomes.

Recently, however, NCBI overhauled their genome database (without warning, from what I can tell), no longer mapping 'strains' to genome identifiers directly, but instead only mapping a species to a genome identifier. Moreover, their e-utilities interface has been limited to only using esummary on the genome database, and there is very little useful information I can gather from the e-utilities record that will be of value on a per-strain basis. Interestingly enough, the web-interface to the genome database provides a much richer view of the genome, including a breakdown of genomes by 'strain', providing information on the genome type (chromosomal vs. plasmid), and links to the nucleotide database sequence (RefSeq and/or INSDC accessions). For reasons beyond my understanding (unless I'm missing something), NCBI does not provide this same information in the e-utilities esummary output.

Thus, I sent an email to the NCBI help desk asking how they recommended handling this, but only received a canned answer pointing me to bulletins announcing the genome database overhaul and what fields to expect in the esummary output. Basically, there was very little value in the email approach.

I then proceeded to call up NCBI, and was in touch with someone who seemed to have some technical knowledge of the databases and interfaces. He basically told me that my approach was wrong using the old database format, regardless of the workflow issues that have arisen from the new database structure. As far as I could tell, he was trying to say that there really is no good way to tell if a genome is 'complete'. I asked if there was a good way to do this in a different way, but was basically told there is no good way to do this. I then wanted to know if there was a recommended external database that I could access (TIGR, GOLD, KEGG), but wasn't helped on this, either. Essentially, I was told there is no way to really reasonably accomplish this.

Thus, my questions, in order, are:

1) Is there a way to programmatically determine if a bacterial genome is essentially 'complete' before a given date? 2) If so, which single-point genome database would be the most comprehensive in providing this information? 3) If there are suggestions for ways to do this in NCBI, the only way I know of is using a "taxonomy" to "nuccore" link. If this is an option, a) How do I distinguish nucleotide entries that represent 'complete' genomes from those that are not 'complete', in a consistent way? I can't seem to find any NCBI annotation guidelines that guarantee that all genbank entries will be annotated indicating a complete genome (although some entries seem to have the text 'complete genome' in their definition, others have 'complete sequence'). b) How do I distinguish chromosomal DNA from plasmid DNA in a consistent way? I can't seem to find any NCBI annotation guidelines that guarantee that all genbank entries will be marked 'plasmid' where applicable.

Thank you,


ncbi genome eutils • 2.6k views
ADD COMMENTlink modified 6.9 years ago by Yakov Pechersky10 • written 6.9 years ago by Bliptrip20
gravatar for Neilfws
6.9 years ago by
Sydney, Australia
Neilfws48k wrote:

I would second James' suggestion and also dig around in the NCBI FTP site: see for example, the file lproks_0.txt.

I've made a public list of Entrez databases and the fields on which they can be searched. For the nucleotide database, I think that the best you can do is search for: "complete genome"[TITL], "bacteria"[ORGN] and "NOT plasmid[TITL]".

As you said, there is no guarantee that those terms are in any way standard components of the DEFINITION line, but it's probably the best you can do. Here's my result, using the BioRuby implementation of EUtils:

require 'rubygems'
require 'bio'

Bio::NCBI.default_email = ""
ncbi   =
search = ncbi.esearch_count("complete genome[TITL] AND bacteria[ORGN] 
                             NOT plasmid[TITL]", {"db" => "nucleotide"})
# => 3192
ADD COMMENTlink written 6.9 years ago by Neilfws48k

Thanks for the hint on using the NCBI ftp site. Nearly all the information I needed was in the lproks_1.txt. My only wish is that NCBI provided an archive (or revision history) of these, as this would give me an idea of the history when the genomes were marked 'complete'.

Yeah, given the apparent lack (as far as I can tell) of guidelines for annotating complete genomes, I think the best I can do is rely on the complete microbial genome project page.

ADD REPLYlink written 6.9 years ago by Bliptrip20
gravatar for James Estevez
6.9 years ago by
Tacoma, WA
James Estevez90 wrote:

Regarding (1), my first guess would be to try the complete microbial genome project page, which contains the release date. Obviously not eutils, but it's a (kludgy) start.

ADD COMMENTlink written 6.9 years ago by James Estevez90
gravatar for Yakov Pechersky
6.8 years ago by
Yakov Pechersky10 wrote:

You can use "complete[Status]" in the Genome database.

ADD COMMENTlink written 6.8 years ago by Yakov Pechersky10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1555 users visited in the last hour