Question: Downloading protein sequences for a set of chromosomes from NCBI
gravatar for bioinfo
4.1 years ago by
bioinfo740 wrote:

Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?

I have a list of chromosomal RefSeq ids (e.g. NC_015600,NC_014498,NC_012468 ..) and I want to get the individual fasta file of all proteins in each chromosome (e.g. NC_015600.faa ,NC_014498.faa ,NC_012468.faa etc.) from NCBI. Any ideas?

genbank efetch ncbi • 1.7k views
ADD COMMENTlink modified 4.1 years ago by Siva1.6k • written 4.1 years ago by bioinfo740
gravatar for RamRS
4.1 years ago by
Houston, TX
RamRS24k wrote:

From a cursory glance, the GenBank records for each chromosome have protein_id records with accession numbers that can be used to get the proteins in FASTA format.

For example, the first protein_id in NC_015600's GenBank record is WP_013851383.1, which can be retrieved using the URL

You'd have to iterate through all available protein_id of each chromosome.

Is there an exception or a special case that prevents this solution from being useful?

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by RamRS24k

Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if  way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)

ADD REPLYlink written 4.1 years ago by glihm600

I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.

usage: bash chr.ids

chr.ids looks like this:

cat $1
while read line;
do curl -r -l1 --no-parents*/"$line.faa" > $line.faa;
done < $1


ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by bioinfo740

I've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help:

ADD REPLYlink written 4.1 years ago by RamRS24k
gravatar for glihm
4.1 years ago by
glihm600 wrote:

Here you have a solution with python ftputil package (sudo pip install ftputil):



import ftputil
import os
import sys

#For unix: (If you are using windows, check it out with the right C: etc...)

#NCBI ftp server
host = ftputil.FTPHost('', 'anonymous', 'password')
#The repertory where you want to extract informations
#Function listdir from host to list names of sub-repertories.
dir_list = host.listdir(host.curdir)
#For each sub-repertory
for dir_name in dir_list :
    #print dir_name
    if host.path.isdir(dir_name) :
        #print dir_name
        #Enter in the dir and recover list of files
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        #Make a dir for each genome
        #Download the file you want from the list of files in genome
        for file_name in file_list :
            #print file_name
            #Choose your extension if you want only .faa file
            if file_name[-4:] == ".faa":
                print "File " + file_name
                if host.path.isfile(file_name) :
                    print "Downloading file " + os.path.join(base_path,dir_name,file_name)
          , os.path.join(base_path,dir_name,file_name))


I commented the code, but the script will:

i) Connect itself to NCBI ftp as anonymous.
ii) Enter in the repertory you need (genome/Bacteria).
iii) Parse each genome, creating a repertory and downloading only ".faa" files.

Hope it helps! ;)

ADD COMMENTlink written 4.1 years ago by glihm600
gravatar for Siva
4.1 years ago by
United States
Siva1.6k wrote:

You can use 'efetch' and set the 'rettype' option to 'fasta_cds_aa'

For chromosome id NC_015600:

Information about valid 'retytpe' and 'retmode' for efetch can be found here

EDIT: If you want to use command line eutilities,

efetch -db nuccore -id NC_015600 -format fasta_cds_aa -mode text > sequence.txt

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Siva1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 741 users visited in the last hour