Question

Number of genomes available in NCBI genome db

0

Entering edit mode

5.5 years ago

emartin.morgan • 0

I am trying to find the total number of genomes available in the genome database on NCBI using E-utilities. I see just from using the website that there are 39,625 when browsing by organism. I'd just like to pull this number using E-utilities. I've used the following code in Mac Terminal, but it returned only 6314 results. curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=overview&rettype=count"

Any ideas how I can edit that code to return the full list of genome IDs?

ncbi genome e-utilities • 925 views

ADD COMMENT • link updated 5.5 years ago by GenoMax 141k • written 5.5 years ago by emartin.morgan • 0

score 1 · Answer 1 · 2018-10-25

1

Entering edit mode

5.5 years ago

GenoMax 141k

Take a look at output of

$ einfo -db genome 

$ einfo -db genome | xtract -pattern DbInfo -element Name -element TermCount
ALL UID FILT    ORGN    PID PRJA    PRJT    DFLN    DSCR    STAT    AID AACC    ANAM    GI  ACCN    RNAM    PACC    PROT    PGI GNID    GENE    LTAG    WGSP    PMID    BIOP    PCID    PROP    CDT STRN    HOST    genome_assembly genome_bioproject   genome_gene genome_nuccore  genome_nuccore_samespecies  genome_protein  genome_proteinclusters  genome_pubmed   genome_taxonomy 
5674172 0   18  393432  81030   81015   21  40408   27281   4   191288  191288  196002  395879  872215  10882   00  0   0   3027959 0   314634  28232   0   61  9   5807    161224  34

You can parse the following two files to get all sorts of information.
Assembly summary file for GenBank can be found here.
Similar file for RefSeq genomes is here.

NCBI's Genome reports: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

ADD COMMENT • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks - I found that number too using: curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=genome"

Any idea why these don't match? Sorry if these are very basic questions, I'm extremely new to this and just starting grad school for Biomedical Informatics.

ADD REPLY • link 5.5 years ago by emartin.morgan • 0

0

Entering edit mode

Genomes are in various stages of completion and as a result they may be listed in different sections. The total number shown may include sequences, maps, chromosomes, assemblies, and annotations

If you are only interested in complete genomes then you could parse the genome reports files as shown here: A: most sequenced genomes

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Okay, I was thinking it was due to any that might be incomplete. Appreciate the help, but I'm not familiar with that language you're using. Is it Perl? Sorry I'm hopeless! Really just trying to understand how to work within terminal using the specific e-utilities.

ADD REPLY • link 5.5 years ago by emartin.morgan • 0

0

Entering edit mode

It is a command line tool made available by NCBI called unix utilities. You can read more about it here. You can also get similar functionality by NCBI eutils a web interface.

ADD REPLY • link 5.5 years ago by GenoMax 141k