Question: ncbi refseq file names meaning
1
gravatar for andrea.rubiop88
2.9 years ago by
andrea.rubiop8820 wrote:

Hello everyone,

I need to download the refseq files for viral genomes from the ncbi database. I found the ftp download link (ftp://ftp.ncbi.nih.gov/refseq/release/viral/) with the files listed below. I've tried to find out what each file is, but I can't find anywhere the meaning of the numbers. What is the difference between viral.1.1.genomic.fna.gz and viral.2.1.genomic.fna.gz? They all seem to be uploaded on the same date, so they can't be different versions. I tried their README (ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/README) but I don't see the information I need.

List of files:

  • viral.1.1.genomic.fna.gz
  • viral.1.genomic.gbff.gz
  • viral.1.protein.faa.gz
  • viral.1.protein.gpff.gz
  • viral.2.1.genomic.fna.gz
  • viral.2.genomic.gbff.gz
  • viral.2.protein.faa.gz
  • viral.2.protein.gpff.gz
  • viral.nonredundant_protein.1.protein.faa.gz
  • viral.nonredundant_protein.1.protein.gpff.gz

Can anyone tell me what the difference is or where to find this information? Thanks a lot!

refseq sequence genome ncbi • 1.7k views
ADD COMMENTlink written 2.9 years ago by andrea.rubiop8820
2
gravatar for genomax
2.9 years ago by
genomax67k
United States
genomax67k wrote:

1.1 and 2.1 are just pieces of the data split into two files (vertebrate refseq have hundreds of such pieces: ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian/ ).

From the release notes (look for current file in this directory):

complete.10.1.bna.gz
|--------|--|-|---|--|
   1      2  3  4   5

   1. directory location 
   2. numerical increment 
       -to provide a set of unique file names
   3. optional: sub-part number 
       -to provide a unique file name for genomic FASTA files which may be split based on size
   3. format type 
   4. compression
ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by genomax67k

Thank you very much for your answer, I had the same question.

So you explained the difference between the files viral.1.protein.faa and viral.2.protein.faa (they are the viral fasta protein database divided into two files - same goes for DNA and genbank).

What is the difference between them and the file viral.nonredundant_protein.1.protein.faa?

isn't RefSeq already a non-redundant database?

ADD REPLYlink written 2.8 years ago by ac.research10

From the release notes:

1.3.3 Biologically non-redundant data set

RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism or sample. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1851 users visited in the last hour