Question: Dowload all complete genomes grom GenBank .gbk full format
0
gravatar for mmart12
2.3 years ago by
mmart1220
European Union
mmart1220 wrote:

Hi all, I would like to know if there's a way to download the complete genbank file (description + genome sequence) for all the strains of one bacterial species in Genbank at once.

Thank you!

sequence genome • 1.2k views
ADD COMMENTlink modified 2.3 years ago by natasha.sernova3.4k • written 2.3 years ago by mmart1220
1
gravatar for 5heikki
2.3 years ago by
5heikki8.3k
Finland
5heikki8.3k wrote:

At once, no. Programmatically, yes. See the ftpfaq and pay special attention to "assembly summary" files.

ADD COMMENTlink written 2.3 years ago by 5heikki8.3k

To elaborate a bot on 5heikki's suggestion: The following commands for example should download all complete genome assemblies for E.coli (taxid: 562)

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
awk -v taxid=562 -v status="Complete Genome" -F $'\t' '$7==taxid && $12==status {print $20 "/" $1 "_" $16 "_genomic.gbff.gz"}' assembly_summary.txt | xargs wget
ADD REPLYlink written 2.3 years ago by thackl2.6k

There is a pretty decent R interface for downloading these data in the ape package. Check out this tutorial

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by conrad.stack0
0
gravatar for Tonor
2.3 years ago by
Tonor420
UK
Tonor420 wrote:

There is a way to do it "manually" - although wouldn't recommend if the species has alot of complete genomes.

For example, Escherichia coli O157:H7, which has the NCBI taxonomy ID: 83334

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=83334&lvl=3&lin=f&keep=1&srchmode=1&unlock

Click on the "Nucleotide" - "Subtree links" in the Entrez Records to view all nucleotide sequences assigned to this taxon (and all the taxon's children - thats what the subtree links means - direct links would just give you all the sequences of this taxon but not its children as well).

This will take you to the NCBI Nucleotide database, with all the Escherichia coli O157:H7 nucleotide sequences dispalyed:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D

You want only complete genomes, so add "Complete Genome" in the entry title into the search criteria:

https://www.ncbi.nlm.nih.gov/nuccore/?term=txid83334%5BOrganism%3Aexp%5D+AND+(complete+genome+%5BTitle%5D)

A quick way to do all the above, is find the taxon ID for your species, go to NCBI nucleotide, and type in "txid83334[Organism:exp] AND (complete genome [Title])" into the search - replacing the 83334 with whatever taxon ID you need.

This will return (as of Nov 2016) 41 complete genomes for the taxon 83334, then on the top right click "Summary", and select "Genbank (full)", then on the top left click "Send", and "Complete Record", to "File", and select what format you want e.g. "Genbank (full)", or "XML".

This will give you all complete genomes of a taxon/species - not necessarily one per strain.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Tonor420

Note that all the Refseq genomes are derived from GenBank genomes, i.e. if you fetch those 41 genomes, you have basically downloaded each genome twice. Also, during submission you get to decide if an assembly is complete or not. IRL complete, chromosome, scaffold and contig status assemblies don't necessarily differ that much from each other. E.g. O157 assembly sizes are quite similar (contig counts are another thing though):

enter image description here

ADD REPLYlink written 2.3 years ago by 5heikki8.3k

It won't be every genome twice, only the refseq genome twice, and you can easily filter out the RefSeqs based on their accession numbers - they have an underscore ( _ ) placed between the prefix and the digits: https://support.ncbi.nlm.nih.gov/link/portal/28045/28049/Article/502/What-are-Reference-Sequence-RefSeq-accession-numbers-and-what-information-is-embedded-in-their-format

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Tonor420
0
gravatar for natasha.sernova
2.3 years ago by
natasha.sernova3.4k
natasha.sernova3.4k wrote:

See my answer to this post: where can I get environmental bacteria genome in fasta format (as many as possible)?

There is an old copy of NCBI, you can download all the gbk-files at once as a gz-file.

After opening it you can select gbk-files for strains you need.

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by natasha.sernova3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour