NCBI CLI Download all proteins from Taxid
1
0
Entering edit mode
4 weeks ago
dthorbur ★ 1.9k

Among other taxonomic groups, I want to download all hemiptera proteins from NCBI using the CLI tool ncbi-datasets-cli v16.10.1 installed with conda v23.5.0.

I've tried using the following command, but get an error.

datasets download gene taxon 7524
Error: The taxonomy ID '7524' is valid for Hemiptera, but the command 'gene download by taxon' requires an at-or-below-species taxon

Alternatively, I can use the genome function over gene:

datasets download genome taxon 7524 --include protein

And whilst this works, it downloads only proteins associated with genome assemblies, getting ~930,000, rather than the ~1,400,000 listed on NCBI proteins.

I want to see if there is a significant difference in clustering and redundancy removal with MMseqs when constructing a database for these two similar datasets. I realise most of the additional proteins will be alleles of annotated genes. This is just a test dataset for a later larger project.

Regardless, is there a way to download all proteins from NCBI using a CLI tool?

ncbi • 177 views
ADD COMMENT
3
Entering edit mode
4 weeks ago
GenoMax 142k

You can use EntrezDirect as one option. This should fetch 1466558 sequences as of today.

$ esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa
>sp|A0A7D0AGU9.1|TPS_MATON RecName: Full=Terpene synthase; Short=EoTPS
MEGLVNNSGDKDLDEKLLQPFTYILQVPGKQIRAKLAHAFNYWLKIPNDKLNIVGEIIQMLHNSSLLIDD
IQDNSILRRGIPVAHSIYGVASTINAANYVIFLAVEKVLRLEHPEATRVCIDQLLELHRGQGIEIYWRDN
FQCPSEDEYKLMTIRKTGGLFMLAIRLMQLFSESDADFTKLAGILGLYFQIRDDYCNLCLQEYSENKSFC

or you could get the species level taxID's using a utility program included in blast+ distribution which then would allow you to use datasets.

$ get_species_taxids.sh -t 7524 > taxidlist
ADD COMMENT

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6