Question

NCBI CLI Download all proteins from Taxid

0

Entering edit mode

8 weeks ago

dthorbur ★ 2.0k

Among other taxonomic groups, I want to download all hemiptera proteins from NCBI using the CLI tool ncbi-datasets-cli v16.10.1 installed with conda v23.5.0.

I've tried using the following command, but get an error.

datasets download gene taxon 7524

Error: The taxonomy ID '7524' is valid for Hemiptera, but the command 'gene download by taxon' requires an at-or-below-species taxon

Alternatively, I can use the genome function over gene:

datasets download genome taxon 7524 --include protein

And whilst this works, it downloads only proteins associated with genome assemblies, getting ~930,000, rather than the ~1,400,000 listed on NCBI proteins.

I want to see if there is a significant difference in clustering and redundancy removal with MMseqs when constructing a database for these two similar datasets. I realise most of the additional proteins will be alleles of annotated genes. This is just a test dataset for a later larger project.

Regardless, is there a way to download all proteins from NCBI using a CLI tool?

ncbi • 337 views

ADD COMMENT • link updated 2 days ago by MirianT_NCBI ▴ 730 • written 8 weeks ago by dthorbur ★ 2.0k

1

Entering edit mode

2 days ago

MirianT_NCBI ▴ 730

Hello,
I did some testing with NCBI Datasets CLI (both gene and genome endpoints) and e-utils, and wanted to share some thoughts. The best approach will depend on the questions you are trying to answer and the data you need. :) I used @Genomax approach to get the taxids and also to download the protein sequences using eutils. Here's the summary:

datasets gene:

It returns information for the 17 reference genomes annotated by NCBI's RefSeq annotation pipeline, plus mitochondrial proteins annotated as part of the NCBI Organelle RefSeq Project. It took around 4 hours to download everything while iterating over the list of Hemiptera taxids.

# BLAST
get_species_taxids.sh -t 7524 > 7524-taxid.list

# Get number of taxids
wc -l 7524-taxid.list 
49847 7524-taxid.list

# download protein sequences from all taxids

cat 7524-taxid.list | while read TAXID; do datasets download gene taxon "$TAXID" --filename $TAXID.zip; done

873 data packages downloaded

# Count number of proteins:
cat */ncbi_dataset/data/protein.faa > all_hymenoptera_proteins.faa; 
grep -c ">" all_hymenoptera_proteins.faa
464,271 proteins

datasets genome:

This command downloads protein sequences from all assembled genomes annotated by either NCBI's RefSeq annotation pipeline (GCF accessions) or annotations submitted to GenBank (GCA accessions). It downloaded everything in less than a minute.

# download protein sequences using the genome endpoint

datasets download genome taxon 7524 --include protein --filename 7524-genome-protein.zip

# Count number of proteins

cat 7524-genome-protein/ncbi_dataset/data/*/protein.faa | grep -c ">"

969,059 (22 GCF and 17 GCA annotated genomes)
    551,399 (22 GCF)
    417,660 (17 GCA)

e-utils:

time esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa 
grep -c ">" file.fa                                          
1,511,837

There are a few things I want to point out regarding e-utils:

This search returns sequences that are nor part of Hemiptera. If you look at the top left corner in the web results, you can see the number of results for plants, bacteria, fungi. The reason is that this search was a string search and not a taxonomic one. You can restrict the results to the desired taxonomy both in the web (using the advanced search option) and on e-utils (by adding the flag -organism Hemiptera).
A lot of the sequences returned are partial, in contrast to the results obtained using datasets.

Let me know if you have any questions or if there's anything we can do to help you.

ADD COMMENT • link 2 days ago by MirianT_NCBI ▴ 730

score 3 · Accepted Answer · 2024-04-02

You can use EntrezDirect as one option. This should fetch 1466558 sequences as of today.

$ esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa
>sp|A0A7D0AGU9.1|TPS_MATON RecName: Full=Terpene synthase; Short=EoTPS
MEGLVNNSGDKDLDEKLLQPFTYILQVPGKQIRAKLAHAFNYWLKIPNDKLNIVGEIIQMLHNSSLLIDD
IQDNSILRRGIPVAHSIYGVASTINAANYVIFLAVEKVLRLEHPEATRVCIDQLLELHRGQGIEIYWRDN
FQCPSEDEYKLMTIRKTGGLFMLAIRLMQLFSESDADFTKLAGILGLYFQIRDDYCNLCLQEYSENKSFC

or you could get the species level taxID's using a utility program included in blast+ distribution which then would allow you to use datasets.

$ get_species_taxids.sh -t 7524 > taxidlist