Downloading protein sequences for a set of chromosomes from NCBI
3
0
Entering edit mode
8.6 years ago
bioinfo ▴ 830

Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?

I have a list of chromosomal RefSeq ids (e.g. NC_015600,NC_014498,NC_012468..) and I want to get the individual fasta file of all proteins in each chromosome (e.g., NC_015600.faa, NC_014498.faa, NC_012468.faa etc.) from NCBI. Any ideas?

ncbi genbank efetch • 3.4k views
ADD COMMENT
0
Entering edit mode
8.6 years ago
Ram 43k

From a cursory glance, the GenBank records for each chromosome have protein_id records with accession numbers that can be used to get the proteins in FASTA format.

For example, the first protein_id in NC_015600's GenBank record is WP_013851383.1, which can be retrieved using the URL http://www.ncbi.nlm.nih.gov/protein/WP_013851383.1?report=fasta&format=text

You'd have to iterate through all available protein_id of each chromosome.

Is there an exception or a special case that prevents this solution from being useful?

ADD COMMENT
0
Entering edit mode

Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)

ADD REPLY
0
Entering edit mode

I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.

usage: bash script.sh chr.ids

chr.ids looks like this:
NC_014225
NC_008800
NC_015224
NC_017564
script.sh:

cat $1
while read line;
do curl -r -l1 --no-parents ftp://ftp.ncbi.nih.gov/genomes/Bacteria/*/"$line.faa" > $line.faa;
done < $1
ADD REPLY
0
Entering edit mode

I've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help: http://stackoverflow.com/questions/18107236/using-wildcards-in-wget-or-curl-query

ADD REPLY
0
Entering edit mode
8.6 years ago
glihm ▴ 660

Here you have a solution with python ftputil package (sudo pip install ftputil):

#!/usr/bin/python

import ftputil
import os
import sys

#For unix: (If you are using windows, check it out with the right C: etc...)
base_path="~/MyGenomes/NCBI"

#NCBI ftp server
host = ftputil.FTPHost('ftp.ncbi.nlm.nih.gov', 'anonymous', 'password')
#The repertory where you want to extract informations
host.chdir('/genomes/Bacteria/')
#Function listdir from host to list names of sub-repertories.
dir_list = host.listdir(host.curdir)
#For each sub-repertory
for dir_name in dir_list :
    #print dir_name
    host.chdir('/genomes/Bacteria/')
    if host.path.isdir(dir_name) :
        #print dir_name
        #Enter in the dir and recover list of files
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        #Make a dir for each genome
        os.chdir(base_path)
        os.mkdir(os.path.join(base_path,dir_name))
        #Download the file you want from the list of files in genome
        for file_name in file_list :
            #print file_name
            #Choose your extension if you want only .faa file
            if file_name[-4:] == ".faa":
                print "File " + file_name
                if host.path.isfile(file_name) :
                    print "Downloading file " + os.path.join(base_path,dir_name,file_name)
                    host.download(file_name, os.path.join(base_path,dir_name,file_name))
            else:
                next

I commented the code, but the script will:

  1. Connect itself to NCBI ftp as anonymous.
  2. Enter in the repertory you need (genome/Bacteria).
  3. Parse each genome, creating a repertory and downloading only ".faa" files.

Hope it helps! ;)

ADD COMMENT
0
Entering edit mode
8.6 years ago
Siva ★ 1.9k

You can use 'efetch' and set the 'rettype' option to 'fasta_cds_aa'

For chromosome id NC_015600:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_015600&rettype=fasta_cds_aa&retmode=text

Information about valid 'retytpe' and 'retmode' for efetch can be found here

EDIT: If you want to use command line eutilities,

efetch -db nuccore -id NC_015600 -format fasta_cds_aa -mode text > sequence.txt

ADD COMMENT

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6