Question

Dowload all complete genomes grom GenBank .gbk full format

0

Entering edit mode

7.4 years ago

mmart12 ▴ 30

Hi all, I would like to know if there's a way to download the complete genbank file (description + genome sequence) for all the strains of one bacterial species in Genbank at once.

Thank you!

genome sequence • 4.1k views

ADD COMMENT • link updated 7.4 years ago by natasha.sernova ★ 4.0k • written 7.4 years ago by mmart12 ▴ 30

score 1 · Answer 1 · 2016-11-24

1

Entering edit mode

7.4 years ago

5heikki 11k

At once, no. Programmatically, yes. See the ftpfaq and pay special attention to "assembly summary" files.

ADD COMMENT • link 7.4 years ago by 5heikki 11k

0

Entering edit mode

To elaborate a bot on 5heikki's suggestion: The following commands for example should download all complete genome assemblies for E.coli (taxid: 562)

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
awk -v taxid=562 -v status="Complete Genome" -F $'\t' '$7==taxid && $12==status {print $20 "/" $1 "_" $16 "_genomic.gbff.gz"}' assembly_summary.txt | xargs wget

ADD REPLY • link 7.4 years ago by thackl ★ 3.0k

0

Entering edit mode

There is a pretty decent R interface for downloading these data in the ape package. Check out this tutorial

ADD REPLY • link 7.2 years ago by conrad.stack • 0

score 0 · Answer 2 · 2016-11-24

There is a way to do it "manually" - although wouldn't recommend if the species has alot of complete genomes.

For example, Escherichia coli O157:H7, which has the NCBI taxonomy ID: 83334

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=83334&lvl=3&lin=f&keep=1&srchmode=1&unlock

Click on the "Nucleotide" - "Subtree links" in the Entrez Records to view all nucleotide sequences assigned to this taxon (and all the taxon's children - thats what the subtree links means - direct links would just give you all the sequences of this taxon but not its children as well).

This will take you to the NCBI Nucleotide database, with all the Escherichia coli O157:H7 nucleotide sequences dispalyed:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D

You want only complete genomes, so add "Complete Genome" in the entry title into the search criteria:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D+AND+(complete+genome+%5BTitle%5D)

A quick way to do all the above, is find the taxon ID for your species, go to NCBI nucleotide, and type in "txid83334[Organism:exp] AND (complete genome [Title])" into the search - replacing the 83334 with whatever taxon ID you need.

This will return (as of Nov 2016) 41 complete genomes for the taxon 83334, then on the top right click "Summary", and select "Genbank (full)", then on the top left click "Send", and "Complete Record", to "File", and select what format you want e.g. "Genbank (full)", or "XML".

This will give you all complete genomes of a taxon/species - not necessarily one per strain.

score 0 · Answer 3 · 2016-11-24

See my answer to this post: where can I get environmental bacteria genome in fasta format (as many as possible)?

There is an old copy of NCBI, you can download all the gbk-files at once as a gz-file.

After opening it you can select gbk-files for strains you need.

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/