Question

ncbi refseq file names meaning

1

Entering edit mode

7.8 years ago

andrea.rubiop88 ▴ 20

Hello everyone,

I need to download the refseq files for viral genomes from the ncbi database. I found the ftp download link (ftp://ftp.ncbi.nih.gov/refseq/release/viral/) with the files listed below. I've tried to find out what each file is, but I can't find anywhere the meaning of the numbers. What is the difference between viral.1.1.genomic.fna.gz and viral.2.1.genomic.fna.gz? They all seem to be uploaded on the same date, so they can't be different versions. I tried their README (ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/README) but I don't see the information I need.

List of files:

viral.1.1.genomic.fna.gz
viral.1.genomic.gbff.gz
viral.1.protein.faa.gz
viral.1.protein.gpff.gz
viral.2.1.genomic.fna.gz
viral.2.genomic.gbff.gz
viral.2.protein.faa.gz
viral.2.protein.gpff.gz
viral.nonredundant_protein.1.protein.faa.gz
viral.nonredundant_protein.1.protein.gpff.gz

Can anyone tell me what the difference is or where to find this information? Thanks a lot!

ncbi refseq genome sequence • 3.9k views

ADD COMMENT • link 7.8 years ago by andrea.rubiop88 ▴ 20

score 2 · Answer 1 · 2016-06-27

2

Entering edit mode

7.8 years ago

GenoMax 141k

1.1 and 2.1 are just pieces of the data split into two files (vertebrate refseq have hundreds of such pieces: ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/ ).

From the release notes (look for current file in this directory):

complete.10.1.bna.gz
|--------|--|-|---|--|
   1      2  3  4   5

   1. directory location 
   2. numerical increment 
       -to provide a set of unique file names
   3. optional: sub-part number 
       -to provide a unique file name for genomic FASTA files which may be split based on size
   3. format type 
   4. compression

ADD COMMENT • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much for your answer, I had the same question.

So you explained the difference between the files viral.1.protein.faa and viral.2.protein.faa (they are the viral fasta protein database divided into two files - same goes for DNA and genbank).

What is the difference between them and the file viral.nonredundant_protein.1.protein.faa?

isn't RefSeq already a non-redundant database?

ADD REPLY • link 7.7 years ago by ac.research ▴ 30

0

Entering edit mode

From the release notes:

1.3.3 Biologically non-redundant data set

RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism or sample. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments.

ADD REPLY • link 7.7 years ago by GenoMax 141k