Question: How To Download Full Genome Sequence
0
gravatar for Palu
7.9 years ago by
Palu170
Palu170 wrote:

Hi, I want to download the all genes of fully sequenced genomes. Is there any easy way to do that? Thanks palu

ADD COMMENTlink modified 2.2 years ago by Hajk-Georg Drost130 • written 7.9 years ago by Palu170
3

Can you clarify your question. First, do you want full genome sequence, as your title suggests, or genes as the text suggests. Second, as you may know, there are now thousands of "fully sequenced genomes", so you may want to narrow it down to a certain subset. (unless it's a pretty specific subset that you want, the answer to your question as is, simply: no).

ADD REPLYlink written 7.9 years ago by brentp23k

Use the internet, silly! :o)

ADD REPLYlink written 7.9 years ago by Martin A Hansen3.0k

actually i want to download genomes sequences of those organisms whose genomes are completely sequenced.

ADD REPLYlink written 7.9 years ago by Palu170
6
gravatar for hadasa
7.9 years ago by
hadasa1.0k
hadasa1.0k wrote:

Different genomes have been sequenced by different institutes with different motivations and interests. As such there is no single site where you can find all the genome information that you may want.

Thus said the NCBI is a good place to start as they curate GenBank database whose contents get mirrored and exchanged with other meta-genomic warehouses such as EMBL and DDBJ.

Please have a look at this as well http://www.ncbi.nlm.nih.gov/sites/genome and this to download genome data for various organisms. ftp://ftp.ncbi.nlm.nih.gov/genomes/

I would suggest you refine your question to be more specific.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by hadasa1.0k
1

Well, most downloads occur "one by one". If you want downloads to run unattended, you simply use an FTP site with a command such as "mget", or an rsync server, or write a small shell script.

ADD REPLYlink written 7.9 years ago by Neilfws48k

actually if i want to download the genome sequence for 200 organisms,for example, then it would not be wise to do so one by one. there i am looking for any convenient way to do so

ADD REPLYlink written 7.9 years ago by Palu170
2
gravatar for Hajk-Georg Drost
2.2 years ago by
Cambridge
Hajk-Georg Drost130 wrote:

Hi,

I also struggled to find a standardized way to automate the genome retrieval process for subsequent data analysis or pipelining for genomics studies. So I sat down and wrote an R package named biomartr to fulfill this task.This way, not every study uses its own home-made shell script to retrieve genomes (which is hard to reproduce if those scripts are not made publically available).

If you really wish to download all available genes for all sequenced genomes (and here I assume that you mean in form of coding sequences (CDS) or protein sequences), the biomartr package includes the following functionality:

For example, if you would like to download CDS files and proteome files for all species available in the NCBI RefSeq database, you will find that to date there is data available for almost 8000 fully sequenced species:

biomartr::listKingdoms(db = "refseq")

Archaea Bacteria Eukaryota Viroids Viruses

  78  1627    425      46   5703
  

To now download CDS for all ~8000 species you can type:

# download all CDS stored in RefSeq
biomartr::meta.retrieval.all(db = "refseq", type = "CDS")

To download all protein sequences for all ~8000 species you can type:

# download all proteomes stored in RefSeq
biomartr::meta.retrieval.all(db = "refseq", type = "proteome")

Alternatively, you can download the entire NCBI RefSeq database by typing:

# download the entire NCBI refseq (protein) database
biomartr::download.database.all(db = "refseq_protein")

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Genomic Sequence Retrieval vignette of the biomartr package. For metagenome downloads, please consult the Meta-Genome Retrieval vignette and for entire database retrieval the Database Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Homo_sapiens_genomic_refseq.fna.gz

Organism Name: Homo_sapiens

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz

Download_Date: Sat Oct 22 12:41:07 2016

refseq_category: reference

genome assembly_accession: GCF_000001405.35

bioproject: PRJNA168

biosample: NA

taxid: 9606

infraspecific_name: NA

version_status: latest

release_type: Patch

genome_rep: Full

seq_rel_date: 2016-09-26

submitter: Genome Reference Consortium

I hope that this new functionality provided by biomartr might be useful for your application and for other genomics projects.

ADD COMMENTlink written 2.2 years ago by Hajk-Georg Drost130

I don't think you noticed that this question was asked almost 6 years ago, however, this looks like a great package so thanks for posting!

ADD REPLYlink written 2.2 years ago by Daniel3.7k
1

Oops, that's my bad :) Many thanks for pointing it out to me. I hope that it is useful anyway for people who have similar questions in the future.

ADD REPLYlink written 2.2 years ago by Hajk-Georg Drost130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1732 users visited in the last hour