ncbi refseq file names meaning
1
1
Entering edit mode
7.8 years ago

Hello everyone,

I need to download the refseq files for viral genomes from the ncbi database. I found the ftp download link (ftp://ftp.ncbi.nih.gov/refseq/release/viral/) with the files listed below. I've tried to find out what each file is, but I can't find anywhere the meaning of the numbers. What is the difference between viral.1.1.genomic.fna.gz and viral.2.1.genomic.fna.gz? They all seem to be uploaded on the same date, so they can't be different versions. I tried their README (ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/README) but I don't see the information I need.

List of files:

  • viral.1.1.genomic.fna.gz
  • viral.1.genomic.gbff.gz
  • viral.1.protein.faa.gz
  • viral.1.protein.gpff.gz
  • viral.2.1.genomic.fna.gz
  • viral.2.genomic.gbff.gz
  • viral.2.protein.faa.gz
  • viral.2.protein.gpff.gz
  • viral.nonredundant_protein.1.protein.faa.gz
  • viral.nonredundant_protein.1.protein.gpff.gz

Can anyone tell me what the difference is or where to find this information? Thanks a lot!

ncbi refseq genome sequence • 3.9k views
ADD COMMENT
2
Entering edit mode
7.8 years ago
GenoMax 141k

1.1 and 2.1 are just pieces of the data split into two files (vertebrate refseq have hundreds of such pieces: ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/ ).

From the release notes (look for current file in this directory):

complete.10.1.bna.gz
|--------|--|-|---|--|
   1      2  3  4   5

   1. directory location 
   2. numerical increment 
       -to provide a set of unique file names
   3. optional: sub-part number 
       -to provide a unique file name for genomic FASTA files which may be split based on size
   3. format type 
   4. compression
ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, I had the same question.

So you explained the difference between the files viral.1.protein.faa and viral.2.protein.faa (they are the viral fasta protein database divided into two files - same goes for DNA and genbank).

What is the difference between them and the file viral.nonredundant_protein.1.protein.faa?

isn't RefSeq already a non-redundant database?

ADD REPLY
0
Entering edit mode

From the release notes:

1.3.3 Biologically non-redundant data set

RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism or sample. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments.

ADD REPLY

Login before adding your answer.

Traffic: 2956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6