Question

Download All The Bacterial Genomes From Ncbi

15

Entering edit mode

11.3 years ago

rehma.ar ▴ 290

Dear all!

i want to download all the bacterial genomes from NCBI. when i check the number of available genomes at NCBI at this link http://www.ncbi.nlm.nih.gov/genome/browse/ it shows the total number of bacterial genomes as 3791. but when i download them from ftp-site ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ using this command wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz it downloads only less than 2300 genomes.

can anyone tell me why is that, and how can i download all of them?

ncbi • 57k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 11.3 years ago by rehma.ar ▴ 290

4

Entering edit mode

A lot of genomes don't have any data. Look at the Chr column in the table, if there is no number then no sequence is available.

ADD REPLY • link 7.5 years ago by Asaf 10k

Ram · Answer 1 · 2016-03-10

14

Entering edit mode

8.1 years ago

kristjan ▴ 170

NCBI has moved complete bacterial genomes file in their ftp site to ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/ where it is not updated anymore. Do you know the reason? And how is it possible to download the most recent complete genomes as a whole fasta file?

ADD COMMENT • link 8.1 years ago by kristjan ▴ 170

8

Entering edit mode

It's not possible to download the most recent complete bacterial genomes as one fasta file.

What you can do is:

Get the list of assemblies: wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

Parse the addresses of complete genomes from it (right now n = 4,804):

awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt

Make a dir for data
```
mkdir GbBac
```

Fetch data

for next in $(cat assembly_summary_complete_genomes.txt); do wget -P GbBac "$next"/*genomic.fna.gz; done

Extract data
```
gunzip GbBac/*.gz
```

Concatenate data

cat GbBac/*.fna > all_complete_Gb_bac.fasta

edit. Where can I read about the recent changes to post formatting @ biostars?

ADD REPLY • link updated 5.4 years ago by Ram 43k • written 8.1 years ago by 5heikki 11k

3

Entering edit mode

Slightly different reply from 5heikki above -- this includes all bacterial sequences, complete and incomplete.

Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

-- at Bash/Mac OSX prompt in the desired directory:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

ADD REPLY • link 8.1 years ago by rattus8 ▴ 40

0

Entering edit mode

I got an error message with your command:

  0 24.0M    0 15928    0     0   9102      0  0:46:15  0:00:01  0:46:14  9101curl: (23) Failed writing body (0 != 2896)

ADD REPLY • link 7.4 years ago by Picasa ▴ 640

0

Entering edit mode

Looks like you don't have write permission in the directory in which you're executing the curl command..

ADD REPLY • link 7.4 years ago by 5heikki 11k

0

Entering edit mode

For those from the future: have a look at "cseto" complement down there regarding updates in the .fna links and avoid spending hours of your precious time trying to fix it (like me =)). The above sed command is now out of date because NCBI changed link adresses.

ADD REPLY • link 6.2 years ago by Leonardo ▴ 30

0

Entering edit mode

I had to experiment with rattus8's response above because I am working on a MacBook Pro and the extended set of regular expressions requires download of gnu sed (I used homebrew to brew install gsed) and there are some syntax differences.... (NB- I also installed gnu awk using brew install gawk). Here is the command for establishing a proper file for wget download of bacterial refseq as of today:

sudo curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt' | gawk 'BEGIN{FS="\t";} /^#/ {next} {print $20}' | gsed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCF_.+)|\1\2\/\2_genomic.fna.gz|' > refseq_file

after which, all you need to do is:

wget -i refseq_file

as rattus8 described above...

Enjoy!

ADD REPLY • link 5.7 years ago by bgold04 • 0

0

Entering edit mode

Hi, I used above described method to download all bacterial genomes present in refseq in the form of *.fna.gz but i also want to get *.gff.gz file for RNA-seq analysis. Please help me in this regard.

ADD REPLY • link 5.7 years ago by mirzaabid • 0

1

Entering edit mode

4) can be made faster with xargs and 8 parallel jobs like this:

cat assembly_summary_complete_genomes.txt | xargs -I{} -n1 -P8 wget -P GbBac {}/*_genomic.fna.gz

thanks @kristian for the very clear toturial!

ADD REPLY • link 5.5 years ago by Stephane Plaisance ▴ 460

0

Entering edit mode

This was very helpful. Thanks 5heikki.

ADD REPLY • link 6.5 years ago by gaurav.amit30 ▴ 80

0

Entering edit mode

It looks like it is still possible to download the most recent complete bacterial genomes as very few FASTA files from here:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria.*.genomic.fna.gz

which should be very straightforward to concatenate them in one big fasta file using zcat.

It is not clear to me though what is the difference between bacterial genomes from the above link and ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/

ADD REPLY • link 5.2 years ago by bstrs • 0

score 7 · Answer 2 · 2017-02-15

I know that this question is already 4 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

or

# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

Alternatively, you can also specify: type = "proteome", type = "CDS" (coding sequence) or type = "gff".

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

score 3 · Answer 3 · 2013-01-17

3

Entering edit mode

11.3 years ago

Rahul Sharma ▴ 660

Hi,

How many sequences are you getting with this wget command? On the mentioned link only 2379 of bacterial species have genomic DNA. Click on the "Download selected records" and use awk -F"\t" '$5>0' genomes_overview.txt | wc -l.

Best wishes, Rahul

ADD COMMENT • link 11.3 years ago by Rahul Sharma ▴ 660

1

Entering edit mode

thanks for responding. yes that's right it gives 2379 but i can only download 2258 with the above mentioned command.

ADD REPLY • link 11.3 years ago by rehma.ar ▴ 290

score 3 · Answer 4 · 2013-01-17

Just adding to what is already here: You are probably able to download "all" of the bacterial genome data that has been released by NCBI.

While NCBI may list 3791 bacterial genomes, these genomes are in various states of completion (actually most genomes are still "drafts" for many many years, if ever designated as non-draft state). It's my understanding that NCBI-listed bacterial genome projects may be recorded during any stage of production (with intent to sequence, sequencing in progress, or in a stage of assembly, annotation, etc.), and you may not be able to download "all" of the "available" genomes in a draft state. Try searching NCBI or elsewhere for contigs for yet fully released genomes. The number of available genomes can change on a day to day basis when NCBI is updating genome drafts, updating servers, moving data from one server to another, so the number of available genomes is in a contant state of flux: so if you wget from the FTP site the file you download may differ from day to day.

I've found that the GOLD database is a good place to check on the status of a specific genome sequencing project.

Ram · Answer 5 · 2014-10-31

2

Entering edit mode

9.5 years ago

Denise CS ★ 5.2k

Ensembl Bacteria has got >15,000 bacterial genomes annotated in the INSDC assembly database as complete. There are gene models too. In the next release of Ensembl Genome, the number will go up to 20,000.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Denise CS ★ 5.2k

score 2 · Answer 6 · 2016-10-21

Slight change to the syntax required for those pulling from bacteria.

From example output, the directory structure :

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000149845.2_SJ5/GCF_000149845.2_SJ5_genomic.fna.gz

An example of column 20 from the bacteria assembly summary:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/183/245/GCA_000183245.1_ASM18324v1/GCA_000183245.1_ASM18324v1_genomic.fna.gz

To account for the change from /all/GCF...../GCF...._genomic.fna.gz to GCA/[...]/[...]/[...]/

New proposed version of the one-liner to construct the URL's for the genomic.fna files is:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | awk '{FS="\t"} !/^#/ {print $20}' | sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' > genomic_file

Having never really used sed like this before, some headscratching took place before I got things working.

score 0 · Answer 7 · 2014-10-30

Hi everybody,

i'm looking to download all complete bacterial genomes. There's a option with http://www.ncbi.nlm.nih.gov/genome/browse/ to show only complete prokaryotic genomes (3243) , and i'm interested in downloading just these. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ provides the possibility to download everything, but thats not what i'm looking for. DOes someone know a possibility for that ? Thank you all !

score 0 · Answer 8 · 2014-11-03

0

Entering edit mode

9.5 years ago

Darko.K • 0

How complete are these > 15.000 genomes ? And is there a possibility provided to download all genomes in FAST(DNA) format with one click ?

ADD COMMENT • link 9.5 years ago by Darko.K • 0

0

Entering edit mode

They are annotated in the INSDC (e.g ENA, European Nucleotide Archive) as a containing the full genome representation with cds annotations for example. You may want to contact ENA for further details on completeness. To download all in one go try wget on ftp://ftp.ensemblgenomes.org/pub/current/bacteria/fasta.

ADD REPLY • link 9.5 years ago by Denise CS ★ 5.2k