Question

Number of genomes sequences in NCBI

0

Entering edit mode

5.2 years ago

misterie ▴ 110

Hi,

Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.

Thank you in advance.

genome ncbi • 1.6k views

ADD COMMENT • link updated 5.2 years ago by vkkodali_ncbi ★ 3.7k • written 5.2 years ago by misterie ▴ 110

score 2 · Answer 1 · 2019-02-10

2

Entering edit mode

5.2 years ago

GenoMax 141k

You can find a summary of all genomes in GenBank in this file.
A similar file is also available for RefSeq genomes.

Parse the files to get "complete genomes" or any other criteria you are looking for.

ADD COMMENT • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

Thank you, but how I can I check number of genome available for Mammals? In this file there is no field that I can use for filtering Mammals...

ADD REPLY • link 5.2 years ago by misterie ▴ 110

2

Entering edit mode

Go to this page. Click to add a filter for mammals. Hit search. Looks like there are 282 at the moment (Feb 19).

ADD REPLY • link 5.2 years ago by GenoMax 141k

1

Entering edit mode

RefSeq assembly_summary.txt files for broad categories such as vertebrate_mammalian are present in corresponding directories in this path: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ For example, the assembly_summary.txt file for the vertebrate_mammalian is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt

ADD REPLY • link 5.2 years ago by vkkodali_ncbi ★ 3.7k

score 1 · Answer 2 · 2019-02-11

1

Entering edit mode

5.2 years ago

Istvan Albert 100k

use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the taxonkit and csvtk

taxonkit: https://github.com/shenwei356/taxonkit
csvtk: https://github.com/shenwei356/csvtk
publication: https://www.biorxiv.org/content/10.1101/513523v1

get the mammalian ids

taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt

filter the assembly file

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P  mammalian.ids.txt > mammalian.txt

this will give you 649 genomes, but not all are unique taxids

cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l

produces:

ADD COMMENT • link 5.2 years ago by Istvan Albert 100k

0

Entering edit mode

one of those code examples that initially looked easy, I knew what needed to be done - but was a lot more frustrating to accomplish and warrants a bug report - the taxonkit list adds an empty line to the file which in turn will match everything on grep - so one also needs to filter the empty lines the grep . .... typical bioinformatics gotcha

ADD REPLY • link 5.2 years ago by Istvan Albert 100k

score 1 · Answer 3 · 2019-02-11

1

Entering edit mode

5.2 years ago

vkkodali_ncbi ★ 3.7k

If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.

Search for the following in NCBI Assembly: mammals[Organism] AND latest_genbank[Properties]. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.
There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.

I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.

ADD COMMENT • link 5.2 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Curious as to why the genomes page I linked above has only 283. One more today than yesterday. It does not match what you/Istvan see.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

I have yet another one: 294

This is based on eukaryotes.txt in genome_reports instead of the assembly_reports (filtered on SubGroup for mammals and counting unique organism names)

Most likely the rates are at which the files are refreshed are different.

ADD REPLY • link 5.2 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

the problem with NCBI interfaces is that one is never quite sure what they do behind the scenes - then it is always "click this", "click that" - by the end there is no indication of whether one did it right - you end up with a number, not quite sure what happened along the way,

I passionately hate the NCBI data interfaces for the reasons I list above - more than any other factor it is the hare-brained data models and interfaces at NCBI that fuel confusion and lack of reproducibility

ADD REPLY • link 5.2 years ago by Istvan Albert 100k