Question: where can i download ncbi and swissuniprot ftp protein, gene and genome sequences for bacterial genomes?
0
gravatar for samuelksm
5.2 years ago by
samuelksm0
Ireland
samuelksm0 wrote:

Am trying to create a local database of bacterial protein, gene and genome sequences, these will be separate but i cannot find the bacterial ftp file for the protein sequences, gene sequences, and genome sequences.

does any one know the actual link to the download?

blast • 3.2k views
ADD COMMENTlink modified 3.3 years ago by Hajk-Georg Drost140 • written 5.2 years ago by samuelksm0
1
gravatar for Kamil
5.2 years ago by
Kamil2.0k
Baltimore
Kamil2.0k wrote:

Check out the NCBI ftp site here: ftp://ftp.ncbi.nlm.nih.gov/

You can browse around for your specific files of interest.

Beware that there are a lot of bacterial genomes in "genomes/Bacteria" so the page will take a long time to load. You can see a summary of the genomes here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/summary.txt

A 2.7G FASTA file with all genomes: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz


Extract the FASTA files from the archive:

tar xf all.fna.tar.gz

cd Wolbachia_wRi_uid59371/

head -n3 NC_012416.fna
>gi|225629872|ref|NC_012416.1| Wolbachia sp. wRi, complete genome
TGATCAATTTTAATGTTTTTATACCCTTTACAACCCATCAAAAAATCACCATAATTTTTAGTATGTATTA
AGTAGTATTAGCTTTTCATTTTGCAGTAAGCTATTGATTATCTTATATTTTTCTAATTATTGCTTTTTTC
ADD COMMENTlink modified 5.1 years ago • written 5.2 years ago by Kamil2.0k

Thank you , i downloaded the proteins, but on unzipping them, i realised they are not fasta, how can i use them to create a blastable database, i was thinking they would be in fasta format?

ADD REPLYlink written 5.1 years ago by samuelksm0

The genomes are in FASTA format. Please see the BLAST manual to learn how to create a database.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Kamil2.0k
1
gravatar for Hajk-Georg Drost
3.3 years ago by
Tuebingen
Hajk-Georg Drost140 wrote:

I know that this question is already 2 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes and corresponding CDS, proteome, and gff files from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

# download all bacterial reference coding sequences from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "CDS")

# download all bacterial reference proteomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "proteome")

# download all bacterial reference gff files from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "gff")

or

# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

# download all bacterial reference coding sequences from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "CDS")

# download all bacterial reference proteomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")

# download all bacterial reference gff files from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "gff")

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

ADD COMMENTlink written 3.3 years ago by Hajk-Georg Drost140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1350 users visited in the last hour