Download all Bacteria accession list from NCBI
2
0
Entering edit mode
3.0 years ago
K.Gee ▴ 40

Hello, biostars,

I want to download all the accession numbers of the bacteria proteins From https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria -->send to --> file --> Format (Accession List) and create file seems to not working for bacteria ( I tested with viruses, archaea and works perfectly) After that, I tried to extract all accession numbers list via the command prompt, but I could not do so. Even ncbi proposed command for the genomes doesn't seem to work "https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria" option command-line tool which gives

datasets download genome taxon 2 --filename bacteria.zip 

I got this error unknown flag: --filename

I also tried to "change" some commands such as genome to genes like ... datasets download gene taxon 2 --filename bacteria.zip, but it downloads the gene with id 2 (parses the term taxon) and I also tried curl 'ftp://ftp.ncbi.nlm.nih.gov/protein/?term=bacteria%5BAll+Fields%5D

Does anybody have an idea how to manipulate this issue?

number accession NCBI • 2.5k views
ADD COMMENT
0
Entering edit mode

A related Python script that you could use (search by FASTA title): How to download all sequences of a list of proteins for a particular organism

ADD REPLY
0
Entering edit mode

Thanks for the response. I will use the script if I ll need to download the respective seqs. Again thanks a lot for the script :D

ADD REPLY
0
Entering edit mode

AFAIK datasets is only meant to work with genome level data. Doing

./datasets download genome taxon 2

will get you information about bacterial genome accessions. You can use

--reference         limit to reference and representative (GCF_ and GCA_) assemblies
--refseq            limit to RefSeq (GCF_) assemblies
ADD REPLY
0
Entering edit mode

Thanks again for the response. I knew that It was based on the genome level, but I saw an option of gene, so my point was to download all the genes and afterwards to extract the ACC numbers... I know that my point was a bit stupid and complicated :P

ADD REPLY
3
Entering edit mode
3.0 years ago
Sej Modha 5.3k

You could use NCBI's command line eutils instead.

esearch -db protein -query 'txid2 [Orgn]'|efetch -format acc > txid2_protein_acc.txt
ADD COMMENT
1
Entering edit mode

A lot of records are going to be WP* accessions which point to multiple organisms. Something to keep in mind.

ADD REPLY
0
Entering edit mode

Thanks a lot for the tip :-) !!!

ADD REPLY
0
Entering edit mode

Super thank you! Works exactly as I want!!!

ADD REPLY
1
Entering edit mode
3.0 years ago
GenoMax 141k

If you have access to nr blast database then use blastdbcmd which is part of blast+ package.

blastdbcmd -db nr -taxids 2 -outfmt %a

If your next question is going to be about creating a subset fasta of these sequences then use

blastdbcmd -db nr -taxids 2 -outfmt %f > bacteria.fa
ADD COMMENT
0
Entering edit mode

This command looks very, very interesting; however, if I understand your response well:

If you have access to nr

Did you mean locally? I'm asking because the command doesn't accept the term "taxids"

ADD REPLY
1
Entering edit mode

Correct. You will need to have nr blast indexes downloaded locally along with taxonomy files. Make sure you have latest blast+ installed.

ADD REPLY

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6