Question: Number of genomes available in NCBI genome db
0
gravatar for emartin.morgan
17 months ago by
emartin.morgan0 wrote:

I am trying to find the total number of genomes available in the genome database on NCBI using E-utilities. I see just from using the website that there are 39,625 when browsing by organism. I'd just like to pull this number using E-utilities. I've used the following code in Mac Terminal, but it returned only 6314 results. curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=overview&rettype=count"

Any ideas how I can edit that code to return the full list of genome IDs?

ncbi genome e-utilities • 370 views
ADD COMMENTlink modified 17 months ago by genomax80k • written 17 months ago by emartin.morgan0
1
gravatar for genomax
17 months ago by
genomax80k
United States
genomax80k wrote:

Take a look at output of

$ einfo -db genome 

$ einfo -db genome | xtract -pattern DbInfo -element Name -element TermCount
ALL UID FILT    ORGN    PID PRJA    PRJT    DFLN    DSCR    STAT    AID AACC    ANAM    GI  ACCN    RNAM    PACC    PROT    PGI GNID    GENE    LTAG    WGSP    PMID    BIOP    PCID    PROP    CDT STRN    HOST    genome_assembly genome_bioproject   genome_gene genome_nuccore  genome_nuccore_samespecies  genome_protein  genome_proteinclusters  genome_pubmed   genome_taxonomy 
5674172 0   18  393432  81030   81015   21  40408   27281   4   191288  191288  196002  395879  872215  10882   00  0   0   3027959 0   314634  28232   0   61  9   5807    161224  34

You can parse the following two files to get all sorts of information.
Assembly summary file for GenBank can be found here.
Similar file for RefSeq genomes is here.

NCBI's Genome reports: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

ADD COMMENTlink modified 17 months ago • written 17 months ago by genomax80k

Thanks - I found that number too using: curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=genome"

Any idea why these don't match? Sorry if these are very basic questions, I'm extremely new to this and just starting grad school for Biomedical Informatics.

ADD REPLYlink written 17 months ago by emartin.morgan0

Genomes are in various stages of completion and as a result they may be listed in different sections. The total number shown may include sequences, maps, chromosomes, assemblies, and annotations

If you are only interested in complete genomes then you could parse the genome reports files as shown here: A: most sequenced genomes

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax80k

Okay, I was thinking it was due to any that might be incomplete. Appreciate the help, but I'm not familiar with that language you're using. Is it Perl? Sorry I'm hopeless! Really just trying to understand how to work within terminal using the specific e-utilities.

ADD REPLYlink modified 17 months ago • written 17 months ago by emartin.morgan0

It is a command line tool made available by NCBI called unix utilities. You can read more about it here. You can also get similar functionality by NCBI eutils a web interface.

ADD REPLYlink written 17 months ago by genomax80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1291 users visited in the last hour