I also struggled to find a standardized way to automate the genome retrieval process for subsequent data analysis or
pipelining for genomics studies. So I sat down and wrote an R package named biomartr to fulfill this task.This way, not every study uses its own home-made shell script
to retrieve genomes (which is hard to reproduce if those scripts are not made publically available).
If you really wish to download all available genes for all sequenced genomes (and here I assume that you mean in form of coding sequences (CDS) or protein sequences), the biomartr package includes the following functionality:
For example, if you would like to download CDS files and proteome files for all species available in the NCBI RefSeq database,
you will find that to date there is data available for almost 8000 fully sequenced species:
biomartr::listKingdoms(db = "refseq")
Archaea Bacteria Eukaryota Viroids Viruses
78 1627 425 46 5703
To now download CDS for all ~8000 species you can type:
# download all CDS stored in RefSeq
biomartr::meta.retrieval.all(db = "refseq", type = "CDS")
To download all protein sequences for all ~8000 species you can type:
# download all proteomes stored in RefSeq
biomartr::meta.retrieval.all(db = "refseq", type = "proteome")
Alternatively, you can download the entire NCBI RefSeq database by typing:
# download the entire NCBI refseq (protein) database
biomartr::download.database.all(db = "refseq_protein")
For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Genomic Sequence Retrieval vignette of the biomartr package. For metagenome downloads, please consult the Meta-Genome Retrieval vignette and for entire database retrieval the Database Retrieval vignette.
Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each
downloaded genome, proteome, or CDS file.
An example log file looks as follows:
File Name: Homo_sapiens_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
Download_Date: Sat Oct 22 12:41:07 2016
genome assembly_accession: GCF_000001405.35
submitter: Genome Reference Consortium
I hope that this new functionality provided by biomartr might be useful for your application and for other genomics projects.