2
0
Entering edit mode
14 months ago
K.Gee ▴ 40

Hello, biostars,

I want to download all the accession numbers of the bacteria proteins From https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria -->send to --> file --> Format (Accession List) and create file seems to not working for bacteria ( I tested with viruses, archaea and works perfectly) After that, I tried to extract all accession numbers list via the command prompt, but I could not do so. Even ncbi proposed command for the genomes doesn't seem to work "https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria" option command-line tool which gives

datasets download genome taxon 2 --filename bacteria.zip


I got this error unknown flag: --filename

I also tried to "change" some commands such as genome to genes like ... datasets download gene taxon 2 --filename bacteria.zip, but it downloads the gene with id 2 (parses the term taxon) and I also tried curl 'ftp://ftp.ncbi.nlm.nih.gov/protein/?term=bacteria%5BAll+Fields%5D

Does anybody have an idea how to manipulate this issue?

number accession NCBI • 1.4k views
0
Entering edit mode

A related Python script that you could use (search by FASTA title): How to download all sequences of a list of proteins for a particular organism

0
Entering edit mode

Thanks for the response. I will use the script if I ll need to download the respective seqs. Again thanks a lot for the script :D

0
Entering edit mode

AFAIK datasets is only meant to work with genome level data. Doing

./datasets download genome taxon 2


will get you information about bacterial genome accessions. You can use

--reference         limit to reference and representative (GCF_ and GCA_) assemblies
--refseq            limit to RefSeq (GCF_) assemblies

0
Entering edit mode

Thanks again for the response. I knew that It was based on the genome level, but I saw an option of gene, so my point was to download all the genes and afterwards to extract the ACC numbers... I know that my point was a bit stupid and complicated :P

3
Entering edit mode
14 months ago
Sej Modha 5.1k

You could use NCBI's command line eutils instead.

esearch -db protein -query 'txid2 [Orgn]'|efetch -format acc > txid2_protein_acc.txt

1
Entering edit mode

A lot of records are going to be WP* accessions which point to multiple organisms. Something to keep in mind.

0
Entering edit mode

Thanks a lot for the tip :-) !!!

0
Entering edit mode

Super thank you! Works exactly as I want!!!

1
Entering edit mode
14 months ago
GenoMax 117k

If you have access to nr blast database then use blastdbcmd which is part of blast+ package.

blastdbcmd -db nr -taxids 2 -outfmt %a


If your next question is going to be about creating a subset fasta of these sequences then use

blastdbcmd -db nr -taxids 2 -outfmt %f > bacteria.fa

0
Entering edit mode

This command looks very, very interesting; however, if I understand your response well:

Did you mean locally? I'm asking because the command doesn't accept the term "taxids"

1
Entering edit mode

Correct. You will need to have nr blast indexes downloaded locally along with taxonomy files. Make sure you have latest blast+ installed.