Download all RefSeq proteins from all organisms in one faa-file
1
1
Entering edit mode
9.2 years ago
seth97 ▴ 10

How can I download all RefSeq proteins from all organisms in one faa-file?

I'm looking at NCBI RefSeq FTP: ftp://ftp.ncbi.nlm.nih.gov/refseq/

I can for example get the rat RefSeq: ftp://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/mRNA_Prot/rat.1.protein.faa.gz

But how can I get all organisms in one file?

Sorry if this is an obvious question.

Thanks!

genome • 13k views
ADD COMMENT
0
Entering edit mode

Thanks for that!

For combining the files I did these commands in the Terminal (mac):

find ./ -name \*.gz -exec gunzip -k {} \;
cat *.faa > ~/output.txt
ADD REPLY
0
Entering edit mode

If you were not restricted to RefSeq, you could download such a single faa directly from Uniprot: http://www.uniprot.org/downloads .

ADD REPLY
4
Entering edit mode
9.2 years ago

Use wget to download everything under ftp://ftp.ncbi.nlm.nih.gov/refseq/release// (http://serverfault.com/questions/25199) and using option --accept=LIST to only keep *.faa.gz , and then concatenate the fasta files....

ADD COMMENT
1
Entering edit mode

Wouldn't this suffice?

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/*.faa.gz

From: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release69.txt

The data that comprises a RefSeq release are available in several file formats, as indicated by the format component in the file name:
bna binary ASN.1 format; includes nucleotide and protein
gbff GenBank flat file format; nucleotide records
gpff GenPept flat file format; protein records
fna FASTA format; nucleotide records
faa FASTA format; protein records

The comprehensive full release is deposited in the "complete" directory and is available in all file types.

ADD REPLY
0
Entering edit mode

Most probably

ADD REPLY

Login before adding your answer.

Traffic: 3252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6