Question: Number of genomes sequences in NCBI
0
gravatar for bwczech
6 months ago by
bwczech70
bwczech70 wrote:

Hi,

Could you tell me how can I check number of organisms for which full genome sequences are available within NCBI? Also I would like to check how many mammals genomes are available in NCBI.

Thank you in advance.

ncbi genome • 356 views
ADD COMMENTlink modified 6 months ago by vkkodali1.1k • written 6 months ago by bwczech70
2
gravatar for genomax
6 months ago by
genomax70k
United States
genomax70k wrote:

Parse the files to get "complete genomes" or any other criteria you are looking for.

ADD COMMENTlink written 6 months ago by genomax70k

Thank you, but how I can I check number of genome available for Mammals? In this file there is no field that I can use for filtering Mammals...

ADD REPLYlink written 6 months ago by bwczech70
2

Go to this page. Click to add a filter for mammals. Hit search. Looks like there are 282 at the moment (Feb 19).

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax70k
1

RefSeq assembly_summary.txt files for broad categories such as vertebrate_mammalian are present in corresponding directories in this path: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ For example, the assembly_summary.txt file for the vertebrate_mammalian is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt

ADD REPLYlink written 6 months ago by vkkodali1.1k
1
gravatar for Istvan Albert
6 months ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the taxonkit and csvtk

get the mammalian ids

taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt

filter the assembly file

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P  mammalian.ids.txt > mammalian.txt

this will give you 649 genomes, but not all are unique taxids

cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l

produces:

 297
ADD COMMENTlink written 6 months ago by Istvan Albert ♦♦ 81k

one of those code examples that initially looked easy, I knew what needed to be done - but was a lot more frustrating to accomplish and warrants a bug report - the taxonkit list adds an empty line to the file which in turn will match everything on grep - so one also needs to filter the empty lines the grep . .... typical bioinformatics gotcha

ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 81k
1
gravatar for vkkodali
6 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.

  1. Search for the following in NCBI Assembly: mammals[Organism] AND latest_genbank[Properties]. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.

  2. There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.

I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.

ADD COMMENTlink written 6 months ago by vkkodali1.1k

Curious as to why the genomes page I linked above has only 283. One more today than yesterday. It does not match what you/Istvan see.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax70k

I have yet another one: 294

This is based on eukaryotes.txt in genome_reports instead of the assembly_reports (filtered on SubGroup for mammals and counting unique organism names)

Most likely the rates are at which the files are refreshed are different.

ADD REPLYlink modified 6 months ago • written 6 months ago by Carambakaracho1.5k

the problem with NCBI interfaces is that one is never quite sure what they do behind the scenes - then it is always "click this", "click that" - by the end there is no indication of whether one did it right - you end up with a number, not quite sure what happened along the way,

I passionately hate the NCBI data interfaces for the reasons I list above - more than any other factor it is the hare-brained data models and interfaces at NCBI that fuel confusion and lack of reproducibility

ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 81k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 669 users visited in the last hour