Downloading the RefSeq proteins complete data set
1
0
Entering edit mode
2.8 years ago
Dunois ★ 2.5k

The RefSeq "complete" dataset is available for download via FTP here. I am interested in the protein sequences therein (the *.protein.faa.gz files).

There seem to be two "sets" of files for the "complete" division:

complete.[0-9]+.protein.faa.gz complete.nonredundant_protein.[0-9]+.protein.faa.gz

According to the relevant NCBI documentation these non-redundant data sets cover bacterial and archaeal sequences.

My question is, to get the complete "complete" dataset, do I need to download all the complete.nonredundant_protein.[0-9]+.protein.faa.gz alongside all the complete.[0-9]+.protein.faa.gz, or would this be (double?) double-counting? Or does the complete.[0-9]+.protein.faa.gz on its own cover all protein sequence data available at NCBI?

ftp refseq • 1.1k views
ADD COMMENT
1
Entering edit mode
2.8 years ago
GenoMax 141k

You will need to download all protein files to get the "complete" dataset. Looking at the summary stats there is only one "protein" line for that directory. You can always email NCBI help desk and confirm.

Directory: complete

    Number of taxids: 111743

    Number of Accessions and total length per molecule type:

    Genomic:    36631677    2303089292160
    RNA:        38417656    100386516185
    Protein:    204185448   79078139531
    Wgs master: 191069  0
ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6