Question

Downloading protein sequences for a set of chromosomes from NCBI

0

Entering edit mode

10.2 years ago

bioinfo ▴ 840

Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?

I have a list of chromosomal RefSeq ids (e.g. NC_015600,NC_014498,NC_012468..) and I want to get the individual fasta file of all proteins in each chromosome (e.g., NC_015600.faa, NC_014498.faa, NC_012468.faa etc.) from NCBI. Any ideas?

ncbi genbank efetch • 4.7k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by bioinfo ▴ 840

Ram · Answer 1 · 2015-08-30

0

Entering edit mode

10.2 years ago

Ram 45k

From a cursory glance, the GenBank records for each chromosome have protein_id records with accession numbers that can be used to get the proteins in FASTA format.

For example, the first protein_id in NC_015600's GenBank record is WP_013851383.1, which can be retrieved using the URL http://www.ncbi.nlm.nih.gov/protein/WP_013851383.1?report=fasta&format=text

You'd have to iterate through all available protein_id of each chromosome.

Is there an exception or a special case that prevents this solution from being useful?

ADD COMMENT • link 3.1 years ago by Ram 45k

0

Entering edit mode

Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by glihm ▴ 660

0

Entering edit mode

I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.

usage: bash script.sh chr.ids

chr.ids looks like this:
NC_014225
NC_008800
NC_015224
NC_017564

script.sh:

cat $1
while read line;
do curl -r -l1 --no-parents ftp://ftp.ncbi.nih.gov/genomes/Bacteria/*/"$line.faa" > $line.faa;
done < $1

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 10.2 years ago by bioinfo ▴ 840

0

Entering edit mode

I've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help: http://stackoverflow.com/questions/18107236/using-wildcards-in-wget-or-curl-query

ADD REPLY • link 3.1 years ago by Ram 45k

Ram · Answer 2 · 2015-08-30

Here you have a solution with python ftputil package (sudo pip install ftputil):

#!/usr/bin/python

import ftputil
import os
import sys

#For unix: (If you are using windows, check it out with the right C: etc...)
base_path="~/MyGenomes/NCBI"

#NCBI ftp server
host = ftputil.FTPHost('ftp.ncbi.nlm.nih.gov', 'anonymous', 'password')
#The repertory where you want to extract informations
host.chdir('/genomes/Bacteria/')
#Function listdir from host to list names of sub-repertories.
dir_list = host.listdir(host.curdir)
#For each sub-repertory
for dir_name in dir_list :
    #print dir_name
    host.chdir('/genomes/Bacteria/')
    if host.path.isdir(dir_name) :
        #print dir_name
        #Enter in the dir and recover list of files
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        #Make a dir for each genome
        os.chdir(base_path)
        os.mkdir(os.path.join(base_path,dir_name))
        #Download the file you want from the list of files in genome
        for file_name in file_list :
            #print file_name
            #Choose your extension if you want only .faa file
            if file_name[-4:] == ".faa":
                print "File " + file_name
                if host.path.isfile(file_name) :
                    print "Downloading file " + os.path.join(base_path,dir_name,file_name)
                    host.download(file_name, os.path.join(base_path,dir_name,file_name))
            else:
                next

I commented the code, but the script will:

Connect itself to NCBI ftp as anonymous.
Enter in the repertory you need (genome/Bacteria).
Parse each genome, creating a repertory and downloading only ".faa" files.

Hope it helps! ;)

score 0 · Answer 3 · 2015-09-25

You can use 'efetch' and set the 'rettype' option to 'fasta_cds_aa'

For chromosome id NC_015600:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_015600&rettype=fasta_cds_aa&retmode=text

Information about valid 'retytpe' and 'retmode' for efetch can be found here

EDIT: If you want to use command line eutilities,

efetch -db nuccore -id NC_015600 -format fasta_cds_aa -mode text > sequence.txt