Question: Download All The Bacterial Genomes From Ncbi
8
gravatar for rehma.ar
5.3 years ago by
rehma.ar200
rehma.ar200 wrote:

Dear all!

i want to download all the bacterial genomes from NCBI. when i check the number of available genomes at NCBI at this link http://www.ncbi.nlm.nih.gov/genome/browse/ it shows the total number of bacterial genomes as 3791. but when i download them from ftp-site ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ using this command wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz it downloads only less than 2300 genomes.

can anyone tell me why is that, and how can i download all of them?

ncbi • 30k views
ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 5.3 years ago by rehma.ar200
4

A lot of genomes don't have any data. Look at the Chr column in the table, if there is no number then no sequence is available.

ADD REPLYlink modified 18 months ago • written 5.3 years ago by Asaf4.8k
7
gravatar for kristjan
2.2 years ago by
kristjan100
Estonia
kristjan100 wrote:

NCBI has moved complete bacterial genomes file in their ftp site to ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/ where it is not updated anymore. Do you know the reason? And how is it possible to download the most recent complete genomes as a whole fasta file?

ADD COMMENTlink written 2.2 years ago by kristjan100
3

It's not possible to download the most recent complete bacterial genomes as one fasta file.

What you can do is:

1. Get the list of assemblies:
    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

2. Parse the addresses of complete genomes from it (right now n = 4,804):
    awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt

3. Make a dir for data
    mkdir GbBac

4. Fetch data
    for next in $(cat assembly_summary_complete_genomes.txt); do wget -P GbBac "$next"/*genomic.fna.gz; done

5. Extract data
    gunzip GbBac/*.gz

6. Concatenate data
    cat GbBac/*.fna > all_complete_Gb_bac.fasta

edit. Where can I read about the recent changes to post formatting @ biostars?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by 5heikki7.2k
3

Slightly different reply from 5heikki above -- this includes all bacterial sequences, complete and incomplete.

Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

-- at Bash/Mac OSX prompt in the desired directory:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by rattus840

I got an error message with your command:

  0 24.0M    0 15928    0     0   9102      0  0:46:15  0:00:01  0:46:14  9101curl: (23) Failed writing body (0 != 2896)
ADD REPLYlink written 17 months ago by Picasa350

Looks like you don't have write permission in the directory in which you're executing the curl command..

ADD REPLYlink modified 17 months ago • written 17 months ago by 5heikki7.2k

For those from the future: have a look at "cseto" complement down there regarding updates in the .fna links and avoid spending hours of your precious time trying to fix it (like me =)). The above sed command is now out of date because NCBI changed link adresses.

ADD REPLYlink written 3 months ago by Leonardo20

This was very helpful. Thanks 5heikki.

ADD REPLYlink written 6 months ago by gaurav.amit3060
6
gravatar for Hajk-Georg Drost
15 months ago by
Cambridge
Hajk-Georg Drost120 wrote:

I know that this question is already 4 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

or

# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

Alternatively, you can also specify: type = "proteome", type = "CDS" (coding sequence) or type = "gff".

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

ADD COMMENTlink written 15 months ago by Hajk-Georg Drost120

Hi, So what if I want to specifically download all genomes available for a bacterial family (Pasteurellaceae)? Regards Ahmed

ADD REPLYlink written 15 months ago by ahmedmagds0

Many thanks for pointing out to me that this functionality might be useful. I sat down and extended the functionality of the meta.retrieval() function which now allows you to specify the "group" argument in addition to the "kingdom" argument. This way, you can download subgroups of species. Unfortunately, NCBI does not provide the family information in their assembly report files that I parse to automatically retrieve the download paths for particular species (only kingdom, group, and subgroup information are available). However, if I am not mistaken, then Pasteurellaceae are members of the class "Gammaproteobacteria". Thus, with biomartr you could now retrieve all bacterial genomes, proteomes, CDS, and gff files that belong to the class "Gammaproteobacteria" as follows:

# retrieve all genomes belonging to Gammaproteobacteria from NCBI RefSeq
meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome")

Please note that for this new functionality of biomartr you need to install the developer version from GitHub (e.g. via the devtools package). In the next CRAN submission, this new functionality will be available.

I also updated the Meta-Genome Retrieval vignette and added some examples of how to retrieve genomes from subgroups of kingdoms. So you might also consult this vignette for more details. I hope this helps you and I am always happy to receive feedback on potential extensions or new features that I could implement into biomartr.

ADD REPLYlink written 15 months ago by Hajk-Georg Drost120
3
gravatar for Rahul Sharma
5.3 years ago by
Rahul Sharma540
Germany and India
Rahul Sharma540 wrote:

Hi,

How many sequences are you getting with this wget command? On the mentioned link only 2379 of bacterial species have genomic DNA. Click on the "Download selected records" and use awk -F"\t" '$5>0' genomes_overview.txt | wc -l.

Best wishes, Rahul

ADD COMMENTlink written 5.3 years ago by Rahul Sharma540
1

thanks for responding. yes that's right it gives 2379 but i can only download 2258 with the above mentioned command.

ADD REPLYlink written 5.3 years ago by rehma.ar200
3
gravatar for Josh Herr
5.3 years ago by
Josh Herr5.5k
University of Nebraska
Josh Herr5.5k wrote:

Just adding to what is already here: You are probably able to download "all" of the bacterial genome data that has been released by NCBI.

While NCBI may list 3791 bacterial genomes, these genomes are in various states of completion (actually most genomes are still "drafts" for many many years, if ever designated as non-draft state). It's my understanding that NCBI-listed bacterial genome projects may be recorded during any stage of production (with intent to sequence, sequencing in progress, or in a stage of assembly, annotation, etc.), and you may not be able to download "all" of the "available" genomes in a draft state. Try searching NCBI or elsewhere for contigs for yet fully released genomes. The number of available genomes can change on a day to day basis when NCBI is updating genome drafts, updating servers, moving data from one server to another, so the number of available genomes is in a contant state of flux: so if you wget from the FTP site the file you download may differ from day to day.

I've found that the GOLD database is a good place to check on the status of a specific genome sequencing project.

ADD COMMENTlink written 5.3 years ago by Josh Herr5.5k
2
gravatar for Denise - Open Targets
3.6 years ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets4.4k wrote:

Ensembl Bacteria has got >15,000 bacterial genomes annotated in the INSDC assembly database as complete. There are gene models too. In the next release of Ensembl Genome, the number will go up to 20,000.

ADD COMMENTlink written 3.6 years ago by Denise - Open Targets4.4k
2
gravatar for ctseto
19 months ago by
ctseto20
ctseto20 wrote:

Slight change to the syntax required for those pulling from bacteria.

From example output, the directory structure :

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000149845.2_SJ5/GCF_000149845.2_SJ5_genomic.fna.gz

An example of column 20 from the bacteria assembly summary:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/183/245/GCA_000183245.1_ASM18324v1/GCA_000183245.1_ASM18324v1_genomic.fna.gz

To account for the change from /all/GCF...../GCF...._genomic.fna.gz to GCA/[...]/[...]/[...]/

New proposed version of the one-liner to construct the URL's for the genomic.fna files is:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | awk '{FS="\t"} !/^#/ {print $20}' | sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' > genomic_file

Having never really used sed like this before, some headscratching took place before I got things working.

ADD COMMENTlink modified 19 months ago • written 19 months ago by ctseto20
1

Oh my!!! So many thanks! Spent hours trying to figure it out.

ADD REPLYlink written 3 months ago by Leonardo20
0
gravatar for Darko.K
3.6 years ago by
Darko.K0
Germany
Darko.K0 wrote:

Hi everybody,

i'm looking to download all complete bacterial genomes. There's a option with http://www.ncbi.nlm.nih.gov/genome/browse/ to show only complete prokaryotic genomes (3243) , and i'm interested in downloading just these. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ provides the possibility to download everything, but thats not what i'm looking for. DOes someone know a possibility for that ? Thank you all !

ADD COMMENTlink written 3.6 years ago by Darko.K0
0
gravatar for Darko.K
3.6 years ago by
Darko.K0
Germany
Darko.K0 wrote:

How complete are these > 15.000 genomes ? And is there a possibility provided to download all genomes in FAST(DNA) format with one click ?

ADD COMMENTlink written 3.6 years ago by Darko.K0

They are annotated in the INSDC (e.g ENA, European Nucleotide Archive) as a containing the full genome representation with cds annotations for example. You may want to contact ENA for further details on completeness. To download all in one go try wget on ftp://ftp.ensemblgenomes.org/pub/current/bacteria/fasta.

ADD REPLYlink written 3.6 years ago by Denise - Open Targets4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour