Unable to download bacterial protein sequences from NCBI using datasets CLI (DOMAIN is not a valid V2reportsRankType error)
1
0
Entering edit mode
12 days ago
Rohan ▴ 40

Hi everyone,

I’m trying to download bacterial protein sequences from NCBI using the datasets command-line tool, but I keep getting an error I don’t understand.

Here are the commands I’ve tried:

datasets download genome taxon 2 --include protein --dehydrated --filename bacteria.zip

and

datasets download genome taxon bacteria --include protein --dehydrated --filename bacteria.zip

Both return the same error message:

Error: DOMAIN is not a valid V2reportsRankType
Use datasets download genome taxon <command> --help for detailed help about a command.

According to the documentation, the taxon argument should accept either a taxon ID or a name, so I’m not sure why this fails.

Has anyone encountered this DOMAIN is not a valid V2reportsRankType error when using datasets?

What’s the correct way to download all bacterial protein sequences (or genomes including proteins) using the latest NCBI Datasets CLI?

Any help or clarification would be appreciated!

prokaryotes datasets ncbi • 492 views
ADD COMMENT
2
Entering edit mode
12 days ago
GenoMax 154k

taxon argument should accept either a taxon ID or a name

Option bacteria does not work in my experience. There are 2.8+ million genomes at this time so it is going to be a very large download.

If you go down one level below the classification bacteria then this does seem to work. For example if you see the taxonomy browser, under bacteria you have the following sections: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?searchTerm=2&searchMode=complete+name&lock=1&unlock=1&command=search Choosing one of those sections is valid.

$ ./datasets download genome taxon Bacillati --include protein --dehydrated --filename bacteria.zip

Collecting 985,291 genome records [------------------------------------------------]   0% 2000/985291

$ ./datasets download genome taxon pseudomonadati --include protein --dehydrated --filename bacteria.zip

Collecting 2,302,066 genome records [------------------------------------------------]   0% 2000/2302066

As a second option you can try the web based datasets.

It may be better to download the "reference" subset of 21,444 genomes from the web "Datasets" page.

Go to : https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=2&reference_only=true Click on the select click box at top left of the table to select all genomes. Click Download and then select Download Package. It will take some time for even the 21.4K genomes to be selected. Once that is done, options to check will become available in the menu. Select Protein fasta. There will again be a delay while this filter is applied. Be sure to uncheck Genome Sequence box if you don't want the nucleotide sequence. The download will then start. It may take quite some time and disk space to complete.

ADD COMMENT

Login before adding your answer.

Traffic: 5660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6