How do I download NCBI prokaryote Genbank and RefSeq databases as single flat text file?
2
0
Entering edit mode
2.3 years ago
Michael • 0

It's easy to download all viral Genbank and RefSeq genomes from NCBI as a single flat text file of nucleotide FASTAs.

However, how do I do this for all prokaryote Genbank and RefSeq genomes?

If I go to the following URL and click "Download Assemblies": https://www.ncbi.nlm.nih.gov/assembly/?term=prokaryota%5Borgn%5D

...then what I get is a single .tar archive, itself containing several hundred thousand .tar archives - each of those containing the text file with the FASTA nucleotide sequence. It would require 2-3 days for my modest but capable Mac Core Duo to untar all these archives and I expect a further day or two for it to cat them into a single flat text file.

So, how can I download a single flat text file (or a manageable number of text files, e.g. 10 files) of the entire NCBI prokaryote Genbank and Refseq databases as nucleotide FASTA?

2
Entering edit mode

To my knowledge, there isn't an equivalent direct programmatic way to do this. It is possible to download all the fast files as text files directly though (see for instance ncbi-genome-download). I would be surprised if cat-ing the files takes that long though, so you should be able to concatenate them all.

If you cant even concatenate a file of that size, you're going to struggle to do any meaningful downstream analysis with something that unwieldy too, so you may need to reconsider your approach.

0
Entering edit mode

Thanks Joe. By ncbi-genome-download, do you mean a third party shell script / an Entrez query / a web portal?

Yes, I would certainly give cat -ing a go - if I absolutely had to.

Once I've got the single text file, even if it is hundreds of GB in size, I've found it's perfectly possible to run practical analysis on it with packages like HMMER.

1
Entering edit mode

ncbi-genome-download from Kai Blin is a utility program that will download genomes for you. You can also use NCBI's newest program called Datasets. More here.

0
Entering edit mode

This doesn't address the question. These utilities download data to separate compressed archives. I am specifically looking for a single flat file download for multiple genomes - as is currently available in the NCBI virus portal - but for prokaryotes.

0
Entering edit mode

There is no pre-created file for prokaryotic genomes. You will need to make it yourself by downloading the genomes.

You could try to create a fasta file from ref_prok_rep_genomes, which is a pre-formatted blast database NCBI makes available on their blast db FTP site. You can use blastdbcmd tool with the data files. This fasta would contain representative genomes as the name says. Perhaps that may work for whatever you are trying to do.

0
Entering edit mode

Thank you, but as I mentioned in my question I am looking for the entire prokaryote RefSeq and Genbank databases.

0
Entering edit mode

There is no such file. You will have to do it the same way everyone else does it and download the genomes separately. You can parse the ftp addresses from the assembly summary files. All RefSeq bacteria is 600-700GB and all GenBank bacteria +1TB. Generally you would want files which fit on the RAM of your computer. That being said, you can zcat the archives, no?

0
Entering edit mode
2.3 years ago
Michael • 0

For anyone else looking to do this, the best solution (involving a manageable number of files) I have found so far is:

## Creating a Bacterial Genbank nucleotide flat file:

• ftp://ftp.ncbi.nlm.nih.gov/genbank/
• Unzip and convert from annotated Genbank (.gbk) format to .fna format using any of a range of tools.
• Concatenate.

It is surprising to me that this is not as straightforward for prokaryotes as it could be. As I mentioned in my original question, on the NCBI Virus web interface, whole-database nucleotide FASTAs (RefSeq and Genbank) can be downloaded as a single nucleotide text file with a single click. I have queried this with NCBI and will update this answer if they can add anything to this.

1
Entering edit mode

Virus web interface, whole-database nucleotide FASTAs (RefSeq and Genbank) can be downloaded as a single nucleotide text file with a single click.

There are 9507 RefSeq viral entries as complete genomes. In terms of nucleotides, that is at least 2-3 orders of magnitude less than a corresponding compendium for bacteria. Frankly, I would be surprised if the file you are referring to contains all RefSeq viral genomes, but I am definitely not surprised that a single file does not exist for all bacterial RefSeq genomes.

0
Entering edit mode
2.3 years ago
Mensur Dlakic ★ 20k

To the best of my knowledge, what you want to do is not possible. I don't know the exact reason, but I would guess it is because there isn't enough demand among users to download a single flat file with all RefSeq genomes. Most people like to customize their downloads, and most people have no problem (g)unzipping and concatenating thousands of files.

The recipe you show above for flat .fna files is most likely incorrect, possibly because you are not pointing at correct directory. It is easy to show using genome_updater that there are 17439 RefSeq bacterial genomes that fulfill the "Complete Genome" criterion (as of June 8th). Likewise, there are 357 RefSeq archaeal genomes fulfilling the same criterion (as of right now). If you do the same exercise but extend this to RefSeq genomes that are not complete (considered in "Contig" state with a reasonably small number of contigs), there are additional 97100 genomes among bacteria (as of June 8th), and another 459 among archaea (as of right now). These numbers are considerably different from what you have.