14 months ago
K.Gee ▴ 40

Hello, biostars,

I want to download all the accession numbers of the bacteria proteins From https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria -->send to --> file --> Format (Accession List) and create file seems to not working for bacteria ( I tested with viruses, archaea and works perfectly) After that, I tried to extract all accession numbers list via the command prompt, but I could not do so. Even ncbi proposed command for the genomes doesn't seem to work "https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria" option command-line tool which gives

datasets download genome taxon 2 --filename bacteria.zip


I got this error unknown flag: --filename

I also tried to "change" some commands such as genome to genes like ... datasets download gene taxon 2 --filename bacteria.zip, but it downloads the gene with id 2 (parses the term taxon) and I also tried curl 'ftp://ftp.ncbi.nlm.nih.gov/protein/?term=bacteria%5BAll+Fields%5D

Does anybody have an idea how to manipulate this issue?

Thanks for the response. I will use the script if I ll need to download the respective seqs. Again thanks a lot for the script :D

AFAIK datasets is only meant to work with genome level data. Doing

./datasets download genome taxon 2


will get you information about bacterial genome accessions. You can use

--reference         limit to reference and representative (GCF_ and GCA_) assemblies
--refseq            limit to RefSeq (GCF_) assemblies

Thanks again for the response. I knew that It was based on the genome level, but I saw an option of gene, so my point was to download all the genes and afterwards to extract the ACC numbers... I know that my point was a bit stupid and complicated :P

14 months ago
Sej Modha 5.1k

You could use NCBI's command line eutils instead.

esearch -db protein -query 'txid2 [Orgn]'|efetch -format acc > txid2_protein_acc.txt

A lot of records are going to be WP* accessions which point to multiple organisms. Something to keep in mind.

Thanks a lot for the tip :-) !!!

Super thank you! Works exactly as I want!!!

14 months ago
GenoMax 117k

If you have access to nr blast database then use blastdbcmd which is part of blast+ package.

blastdbcmd -db nr -taxids 2 -outfmt %a


If your next question is going to be about creating a subset fasta of these sequences then use

blastdbcmd -db nr -taxids 2 -outfmt %f > bacteria.fa

This command looks very, very interesting; however, if I understand your response well:

Did you mean locally? I'm asking because the command doesn't accept the term "taxids"

Correct. You will need to have nr blast indexes downloaded locally along with taxonomy files. Make sure you have latest blast+ installed.