use the 6th column to identify the taxonomy. Then to process ids by taxonomy use the
- taxonkit: https://github.com/shenwei356/taxonkit
- csvtk: https://github.com/shenwei356/csvtk
- publication: https://www.biorxiv.org/content/10.1101/513523v1
get the mammalian ids
taxonkit list --ids 40674 --indent "" | grep . > mammalian.ids.txt
filter the assembly file
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt cat assembly_summary_genbank.txt | cut -f 1,6,8 | csvtk -t grep -f 2 -P mammalian.ids.txt > mammalian.txt
this will give you 649 genomes, but not all are unique taxids
cat mammalian.txt | cut -f 2 | sort | uniq -c | sort -rn | wc -l
If all you are after is a count (and not particularly interested in downloading the assemblies), you can do this from the NCBI Assembly web portal. This way, you won't have to bother downloading and installing a couple of other programs on your machine if you don't want to.
Search for the following in NCBI Assembly:
mammals[Organism] AND latest_genbank[Properties]. This will return 633 assemblies. You can search for other broad categories such as 'rodents', 'plants', etc as well.
There may be multiple assemblies submitted for a single species. So, you need to flatten this list to a unique set of taxa. You can do this by following the link to Taxonomy page. On the right hand side of panel of the Assembly results page you will notice a 'Find related data' facet with a drop-down list of databases beneath it. From that drop-down list, choose 'Taxonomy' and click the 'Find items' button. You will be directed to a Taxonomy results page. The count there, 295, is what you are looking for.
I am not entirely sure why there is a discrepancy of 2, compared to the taxonkit/csvtk method described above; I did not download the two programs and run them on my own.