I shoot an email to NCBI, and I think their response would be very nicely solve my problem. Considering it will benefit someone else who might have similar confusion, I post it as follows:
is a legacy (the old genome site) directory that will eventually be retired (archived and no longer updated).
is a new directory from NCBI FTP genome site reorganization. Here is the information about the new site:
To access current and actively updated genome assembly data, use the following three directories on the NCBI Genomes FTP site: genbank, refseq, and all.
genbank is a directory of primary genome assembly data and contains assembled genome sequences and associated annotations (if available) that sequencing centers or individual investigators submitted to GenBank or to another member of the International Nucleotide Sequence Database Collaboration (INSDC). You should use this directory if you are interested in obtaining all submitted genome assemblies and your main focus is not accessing genome annotation. The directory is organized by taxonomic groups and you will be able to browse it directly.
refseq is a directory of NCBI-derived genome assembly data containing assembled genomes that NCBI RefSeq staff selected from the primary INSDC data. You should use the refseq directory if you are interested in annotation data that are of high quality and regularly maintained. The sequences of a RefSeq genomic assembly are a copy of those present in the corresponding INSDC assembly. In some cases the copy may not be completely identical as the RefSeq staff may (1) remove smaller pieces (known as contigs) of a sequence or reported contaminants or (2) add non-nuclear genome sequences (for example, mitochondrion) to the assembly. To find primary GenBank (INSDC) assemblies used to create the RefSeq assemblies, use the assembly reports files. All RefSeq genome assemblies have annotations that RefSeq staff either propagated from the primary records or provided through NCBI prokaryotic or eukaryotic genome annotation pipelines. The number of genomic assemblies present in the refseq directory is smaller than that in the genbank directory. The directory is organized by taxonomic groups and you will be able to browse it directly.
all is a directory that combines the contents of the genbank and refseq directories. Each individual assembly data file is contained in an individual sub-directory. The all directory holds many thousands of sub-directories and you should only access it as a path to a known assembly. Many of the sub-directories are for old versions of assemblies; these are archival and the RefSeq staff will not update them with new data or data in new file formats.
All other directories on the NCBI Genomes FTP site are legacy directories and we will be sequentially archiving them. If you are using any of these directories, pay attention to their update dates to assure that you are obtaining current data. If you find a directory missing, check if it has already been moved into the archive directory, which you will also find on the Genomes FTP site. Read more about the FTP genomes site structure and learn details on the site reorganization, content, file formats, downloading instructions, and future plans.
4.2 years ago by
Tao • 380