Question

How to choose NCBI viral database?

1

Entering edit mode

9.1 years ago

Tao ▴ 540

Hi Guys,

I have noticed there are two folders on NCBI ftp server which contain viral genomes:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/

I don’t understand why NCBI put viral genomes into two locations? And I found the first folder contains less genomes than the second one, for example Human herpesvirus 1 and 2 cannot be found in the first folder but can be found in the second folder. So, what’s the difference between viral genomes in the two folders? What's their purpose to put them in two locations? How should we choose if we need to build a blastdatabase on all viral genomes?

Thanks, Tao

ncbi refseq genome viral • 5.9k views

ADD COMMENT • link 9.1 years ago by Tao ▴ 540

1

Entering edit mode

9.1 years ago

fanli.gcb ▴ 730

This README and this FAQ about the recent reorganization may be helpful. In short, use the refseq/viral/ folder if you want RefSeq viral genomes. A useful quote:

Historically, the genomes FTP site has been populated by different process flows and NCBI working groups leading to undesirable differences in available content and file formats. Also, data for GenBank genomes and RefSeq genomes were located in different areas of the NCBI FTP site that had different organization.

NCBI has redesigned the genomes FTP site to expand the content and facilitate data access through an organized predictable directory hierarchy with consistent file names and formats. The updated site provides greater support for downloading assembled genome sequences and/or corresponding annotation data. The new FTP site structure provides a single entry point to access content representing either GenBank or RefSeq data.

ADD COMMENT • link 9.1 years ago by fanli.gcb ▴ 730

0

Entering edit mode

9.1 years ago

natasha.sernova ★ 4.0k

I would try this url:

http://www.ncbi.nlm.nih.gov/genome/viruses/

and inside press VIRAL GENOME BROWSER; http://www.ncbi.nlm.nih.gov/genome/viruses/

ADD COMMENT • link 9.1 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

That's a good idea, but I'm still confused why NCBI did in that way.

ADD REPLY • link 9.1 years ago by Tao ▴ 540

0

Entering edit mode

There are 5559 complete genomes in this database. There should be README file somewhere.

http://www.ncbi.nlm.nih.gov/genome/viruses/about/

They explain here what can be found there and where exactly. There are multiple ways,

it is not obvious which one is for all viruses. I would ask the database reponsible people - the recent reorganization

is not really clear.

ADD REPLY • link 9.1 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Thanks, I also shoot an email to ncbi. And their reply is very clear, please see the answer added by myself.

ADD REPLY • link 9.1 years ago by Tao ▴ 540

score 4 · Accepted Answer · 2016-05-19

I shoot an email to NCBI, and I think their response would be very nicely solve my problem. Considering it will benefit someone else who might have similar confusion, I post it as follows:

Dear Colleague,

The following: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/

is a legacy (the old genome site) directory that will eventually be retired (archived and no longer updated).

This one:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/

is a new directory from NCBI FTP genome site reorganization. Here is the information about the new site:

To access current and actively updated genome assembly data, use the following three directories on the NCBI Genomes FTP site: genbank, refseq, and all.

genbank is a directory of primary genome assembly data and contains assembled genome sequences and associated annotations (if available) that sequencing centers or individual investigators submitted to GenBank or to another member of the International Nucleotide Sequence Database Collaboration (INSDC). You should use this directory if you are interested in obtaining all submitted genome assemblies and your main focus is not accessing genome annotation. The directory is organized by taxonomic groups and you will be able to browse it directly. refseq is a directory of NCBI-derived genome assembly data containing assembled genomes that NCBI RefSeq staff selected from the primary INSDC data. You should use the refseq directory if you are interested in annotation data that are of high quality and regularly maintained. The sequences of a RefSeq genomic assembly are a copy of those present in the corresponding INSDC assembly. In some cases the copy may not be completely identical as the RefSeq staff may (1) remove smaller pieces (known as contigs) of a sequence or reported contaminants or (2) add non-nuclear genome sequences (for example, mitochondrion) to the assembly. To find primary GenBank (INSDC) assemblies used to create the RefSeq assemblies, use the assembly reports files. All RefSeq genome assemblies have annotations that RefSeq staff either propagated from the primary records or provided through NCBI prokaryotic or eukaryotic genome annotation pipelines. The number of genomic assemblies present in the refseq directory is smaller than that in the genbank directory. The directory is organized by taxonomic groups and you will be able to browse it directly. all is a directory that combines the contents of the genbank and refseq directories. Each individual assembly data file is contained in an individual sub-directory. The all directory holds many thousands of sub-directories and you should only access it as a path to a known assembly. Many of the sub-directories are for old versions of assemblies; these are archival and the RefSeq staff will not update them with new data or data in new file formats.

All other directories on the NCBI Genomes FTP site are legacy directories and we will be sequentially archiving them. If you are using any of these directories, pay attention to their update dates to assure that you are obtaining current data. If you find a directory missing, check if it has already been moved into the archive directory, which you will also find on the Genomes FTP site. Read more about the FTP genomes site structure and learn details on the site reorganization, content, file formats, downloading instructions, and future plans.

Best regards