Question: Download All The Bacterial Genomes From Ncbi
gravatar for
5.6 years ago by
rehma.ar220 wrote:

Dear all!

i want to download all the bacterial genomes from NCBI. when i check the number of available genomes at NCBI at this link it shows the total number of bacterial genomes as 3791. but when i download them from ftp-site using this command wget it downloads only less than 2300 genomes.

can anyone tell me why is that, and how can i download all of them?

ncbi • 32k views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 5.6 years ago by rehma.ar220

A lot of genomes don't have any data. Look at the Chr column in the table, if there is no number then no sequence is available.

ADD REPLYlink modified 21 months ago • written 5.6 years ago by Asaf4.8k
gravatar for kristjan
2.4 years ago by
kristjan110 wrote:

NCBI has moved complete bacterial genomes file in their ftp site to where it is not updated anymore. Do you know the reason? And how is it possible to download the most recent complete genomes as a whole fasta file?

ADD COMMENTlink written 2.4 years ago by kristjan110

It's not possible to download the most recent complete bacterial genomes as one fasta file.

What you can do is:

1. Get the list of assemblies:

2. Parse the addresses of complete genomes from it (right now n = 4,804):
    awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt

3. Make a dir for data
    mkdir GbBac

4. Fetch data
    for next in $(cat assembly_summary_complete_genomes.txt); do wget -P GbBac "$next"/*genomic.fna.gz; done

5. Extract data
    gunzip GbBac/*.gz

6. Concatenate data
    cat GbBac/*.fna > all_complete_Gb_bac.fasta

edit. Where can I read about the recent changes to post formatting @ biostars?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by 5heikki7.5k

Slightly different reply from 5heikki above -- this includes all bacterial sequences, complete and incomplete.

Here is my recipe, adapted from Case 1 in this document:

-- at Bash/Mac OSX prompt in the desired directory:

curl '' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site ( and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by rattus840

I got an error message with your command:

  0 24.0M    0 15928    0     0   9102      0  0:46:15  0:00:01  0:46:14  9101curl: (23) Failed writing body (0 != 2896)
ADD REPLYlink written 20 months ago by Picasa350

Looks like you don't have write permission in the directory in which you're executing the curl command..

ADD REPLYlink modified 20 months ago • written 20 months ago by 5heikki7.5k

For those from the future: have a look at "cseto" complement down there regarding updates in the .fna links and avoid spending hours of your precious time trying to fix it (like me =)). The above sed command is now out of date because NCBI changed link adresses.

ADD REPLYlink written 6 months ago by Leonardo20

I had to experiment with rattus8's response above because I am working on a MacBook Pro and the extended set of regular expressions requires download of gnu sed (I used homebrew to brew install gsed) and there are some syntax differences.... (NB- I also installed gnu awk using brew install gawk). Here is the command for establishing a proper file for wget download of bacterial refseq as of today:

sudo curl '' | gawk 'BEGIN{FS="\t";} /^#/ {next} {print $20}' | gsed -r 's|(|\1\2\/\2_genomic.fna.gz|' > refseq_file

after which, all you need to do is:

wget -i refseq_file

as rattus8 described above...


ADD REPLYlink modified 14 days ago • written 14 days ago by bgold040

Hi, I used above described method to download all bacterial genomes present in refseq in the form of *.fna.gz but i also want to get *.gff.gz file for RNA-seq analysis. Please help me in this regard.

ADD REPLYlink written 9 days ago by mirzaabid0

This was very helpful. Thanks 5heikki.

ADD REPLYlink written 9 months ago by gaurav.amit3060
gravatar for Hajk-Georg Drost
18 months ago by
Hajk-Georg Drost130 wrote:

I know that this question is already 4 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")


# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

Alternatively, you can also specify: type = "proteome", type = "CDS" (coding sequence) or type = "gff".

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq


Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

ADD COMMENTlink written 18 months ago by Hajk-Georg Drost130

Hi, So what if I want to specifically download all genomes available for a bacterial family (Pasteurellaceae)? Regards Ahmed

ADD REPLYlink written 18 months ago by ahmedmagds0

Many thanks for pointing out to me that this functionality might be useful. I sat down and extended the functionality of the meta.retrieval() function which now allows you to specify the "group" argument in addition to the "kingdom" argument. This way, you can download subgroups of species. Unfortunately, NCBI does not provide the family information in their assembly report files that I parse to automatically retrieve the download paths for particular species (only kingdom, group, and subgroup information are available). However, if I am not mistaken, then Pasteurellaceae are members of the class "Gammaproteobacteria". Thus, with biomartr you could now retrieve all bacterial genomes, proteomes, CDS, and gff files that belong to the class "Gammaproteobacteria" as follows:

# retrieve all genomes belonging to Gammaproteobacteria from NCBI RefSeq
meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome")

Please note that for this new functionality of biomartr you need to install the developer version from GitHub (e.g. via the devtools package). In the next CRAN submission, this new functionality will be available.

I also updated the Meta-Genome Retrieval vignette and added some examples of how to retrieve genomes from subgroups of kingdoms. So you might also consult this vignette for more details. I hope this helps you and I am always happy to receive feedback on potential extensions or new features that I could implement into biomartr.

ADD REPLYlink written 18 months ago by Hajk-Georg Drost130
gravatar for Rahul Sharma
5.6 years ago by
Rahul Sharma550
Germany and India
Rahul Sharma550 wrote:


How many sequences are you getting with this wget command? On the mentioned link only 2379 of bacterial species have genomic DNA. Click on the "Download selected records" and use awk -F"\t" '$5>0' genomes_overview.txt | wc -l.

Best wishes, Rahul

ADD COMMENTlink written 5.6 years ago by Rahul Sharma550

thanks for responding. yes that's right it gives 2379 but i can only download 2258 with the above mentioned command.

ADD REPLYlink written 5.6 years ago by rehma.ar220
gravatar for Josh Herr
5.6 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

Just adding to what is already here: You are probably able to download "all" of the bacterial genome data that has been released by NCBI.

While NCBI may list 3791 bacterial genomes, these genomes are in various states of completion (actually most genomes are still "drafts" for many many years, if ever designated as non-draft state). It's my understanding that NCBI-listed bacterial genome projects may be recorded during any stage of production (with intent to sequence, sequencing in progress, or in a stage of assembly, annotation, etc.), and you may not be able to download "all" of the "available" genomes in a draft state. Try searching NCBI or elsewhere for contigs for yet fully released genomes. The number of available genomes can change on a day to day basis when NCBI is updating genome drafts, updating servers, moving data from one server to another, so the number of available genomes is in a contant state of flux: so if you wget from the FTP site the file you download may differ from day to day.

I've found that the GOLD database is a good place to check on the status of a specific genome sequencing project.

ADD COMMENTlink written 5.6 years ago by Josh Herr5.6k
gravatar for Denise - Open Targets
3.8 years ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets4.6k wrote:

Ensembl Bacteria has got >15,000 bacterial genomes annotated in the INSDC assembly database as complete. There are gene models too. In the next release of Ensembl Genome, the number will go up to 20,000.

ADD COMMENTlink written 3.8 years ago by Denise - Open Targets4.6k
gravatar for ctseto
22 months ago by
ctseto20 wrote:

Slight change to the syntax required for those pulling from bacteria.

From example output, the directory structure :

An example of column 20 from the bacteria assembly summary:

To account for the change from /all/GCF...../GCF...._genomic.fna.gz to GCA/[...]/[...]/[...]/

New proposed version of the one-liner to construct the URL's for the genomic.fna files is:

curl '' | awk '{FS="\t"} !/^#/ {print $20}' | sed -r 's|([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' > genomic_file

Having never really used sed like this before, some headscratching took place before I got things working.

ADD COMMENTlink modified 22 months ago • written 22 months ago by ctseto20

Oh my!!! So many thanks! Spent hours trying to figure it out.

ADD REPLYlink written 6 months ago by Leonardo20
gravatar for Darko.K
3.8 years ago by
Darko.K0 wrote:

Hi everybody,

i'm looking to download all complete bacterial genomes. There's a option with to show only complete prokaryotic genomes (3243) , and i'm interested in downloading just these. provides the possibility to download everything, but thats not what i'm looking for. DOes someone know a possibility for that ? Thank you all !

ADD COMMENTlink written 3.8 years ago by Darko.K0
gravatar for Darko.K
3.8 years ago by
Darko.K0 wrote:

How complete are these > 15.000 genomes ? And is there a possibility provided to download all genomes in FAST(DNA) format with one click ?

ADD COMMENTlink written 3.8 years ago by Darko.K0

They are annotated in the INSDC (e.g ENA, European Nucleotide Archive) as a containing the full genome representation with cds annotations for example. You may want to contact ENA for further details on completeness. To download all in one go try wget on

ADD REPLYlink written 3.8 years ago by Denise - Open Targets4.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1739 users visited in the last hour