Question: Number of genomes sequences in NCBI
gravatar for bwczech
9 days ago by
bwczech60 wrote:


Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.

Thank you in advance.

ncbi genome • 190 views
ADD COMMENTlink modified 7 days ago by vkkodali950 • written 9 days ago by bwczech60
gravatar for genomax
9 days ago by
United States
genomax62k wrote:

Parse the files to get "complete genomes" or any other criteria you are looking for.

ADD COMMENTlink written 9 days ago by genomax62k

Thank you, but how I can I check number of genome available for Mammals? In this file there is no field that I can use for filtering Mammals...

ADD REPLYlink written 9 days ago by bwczech60

Go to this page. Click to add a filter for mammals. Hit search. Looks like there are 282 at the moment (Feb 19).

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax62k

RefSeq assembly_summary.txt files for broad categories such as vertebrate_mammalian are present in corresponding directories in this path: For example, the assembly_summary.txt file for the vertebrate_mammalian is here:

ADD REPLYlink written 9 days ago by vkkodali950
gravatar for Istvan Albert
8 days ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the taxonkit and csvtk

get the mammalian ids

taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt

filter the assembly file

cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P  mammalian.ids.txt > mammalian.txt

this will give you 649 genomes, but not all are unique taxids

cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l


ADD COMMENTlink written 8 days ago by Istvan Albert ♦♦ 79k

one of those code examples that initially looked easy, I knew what needed to be done - but was a lot more frustrating to accomplish and warrants a bug report - the taxonkit list adds an empty line to the file which in turn will match everything on grep - so one also needs to filter the empty lines the grep . .... typical bioinformatics gotcha

ADD REPLYlink written 8 days ago by Istvan Albert ♦♦ 79k
gravatar for vkkodali
7 days ago by
United States
vkkodali950 wrote:

If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.

  1. Search for the following in NCBI Assembly: mammals[Organism] AND latest_genbank[Properties]. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.

  2. There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.

I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.

ADD COMMENTlink written 7 days ago by vkkodali950

Curious as to why the genomes page I linked above has only 283. One more today than yesterday. It does not match what you/Istvan see.

ADD REPLYlink modified 7 days ago • written 7 days ago by genomax62k

I have yet another one: 294

This is based on eukaryotes.txt in genome_reports instead of the assembly_reports (filtered on SubGroup for mammals and counting unique organism names)

Most likely the rates are at which the files are refreshed are different.

ADD REPLYlink modified 7 days ago • written 7 days ago by Carambakaracho760

the problem with NCBI interfaces is that one is never quite sure what they do behind the scenes - then it is always "click this", "click that" - by the end there is no indication of whether one did it right - you end up with a number, not quite sure what happened along the way,

I passionately hate the NCBI data interfaces for the reasons I list above - more than any other factor it is the hare-brained data models and interfaces at NCBI that fuel confusion and lack of reproducibility

ADD REPLYlink written 5 days ago by Istvan Albert ♦♦ 79k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2006 users visited in the last hour