Question: Downloading protein sequences for a set of chromosomes from NCBI
0
gravatar for bioinfo
3.6 years ago by
bioinfo690
EU
bioinfo690 wrote:

Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?

I have a list of chromosomal RefSeq ids (e.g. NC_015600,NC_014498,NC_012468 ..) and I want to get the individual fasta file of all proteins in each chromosome (e.g. NC_015600.faa ,NC_014498.faa ,NC_012468.faa etc.) from NCBI. Any ideas?

genbank efetch ncbi • 1.6k views
ADD COMMENTlink modified 3.5 years ago by Siva1.6k • written 3.6 years ago by bioinfo690
0
gravatar for RamRS
3.6 years ago by
RamRS20k
Houston, TX
RamRS20k wrote:

From a cursory glance, the GenBank records for each chromosome have protein_id records with accession numbers that can be used to get the proteins in FASTA format.

For example, the first protein_id in NC_015600's GenBank record is WP_013851383.1, which can be retrieved using the URL http://www.ncbi.nlm.nih.gov/protein/WP_013851383.1?report=fasta&format=text

You'd have to iterate through all available protein_id of each chromosome.

Is there an exception or a special case that prevents this solution from being useful?

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by RamRS20k

Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if  way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)

ADD REPLYlink written 3.6 years ago by glihm590

I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.

usage: bash script.sh chr.ids

chr.ids looks like this:
NC_014225
NC_008800
NC_015224
NC_017564
script.sh:

cat $1
while read line;
do curl -r -l1 --no-parents ftp://ftp.ncbi.nih.gov/genomes/Bacteria/*/"$line.faa" > $line.faa;
done < $1

 

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by bioinfo690

I've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help: http://stackoverflow.com/questions/18107236/using-wildcards-in-wget-or-curl-query

ADD REPLYlink written 3.6 years ago by RamRS20k
0
gravatar for glihm
3.6 years ago by
glihm590
France
glihm590 wrote:

Here you have a solution with python ftputil package (sudo pip install ftputil):



 

#!/usr/bin/python

import ftputil
import os
import sys

#For unix: (If you are using windows, check it out with the right C: etc...)
base_path="~/MyGenomes/NCBI"

#NCBI ftp server
host = ftputil.FTPHost('ftp.ncbi.nlm.nih.gov', 'anonymous', 'password')
#The repertory where you want to extract informations
host.chdir('/genomes/Bacteria/')
#Function listdir from host to list names of sub-repertories.
dir_list = host.listdir(host.curdir)
#For each sub-repertory
for dir_name in dir_list :
    #print dir_name
    host.chdir('/genomes/Bacteria/')
    if host.path.isdir(dir_name) :
        #print dir_name
        #Enter in the dir and recover list of files
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        #Make a dir for each genome
        os.chdir(base_path)
        os.mkdir(os.path.join(base_path,dir_name))
        #Download the file you want from the list of files in genome
        for file_name in file_list :
            #print file_name
            #Choose your extension if you want only .faa file
            if file_name[-4:] == ".faa":
                print "File " + file_name
                if host.path.isfile(file_name) :
                    print "Downloading file " + os.path.join(base_path,dir_name,file_name)
                    host.download(file_name, os.path.join(base_path,dir_name,file_name))
            else:
                next

 

I commented the code, but the script will:

i) Connect itself to NCBI ftp as anonymous.
ii) Enter in the repertory you need (genome/Bacteria).
iii) Parse each genome, creating a repertory and downloading only ".faa" files.

Hope it helps! ;)

ADD COMMENTlink written 3.6 years ago by glihm590
0
gravatar for Siva
3.5 years ago by
Siva1.6k
United States
Siva1.6k wrote:

You can use 'efetch' and set the 'rettype' option to 'fasta_cds_aa'

For chromosome id NC_015600:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_015600&rettype=fasta_cds_aa&retmode=text

Information about valid 'retytpe' and 'retmode' for efetch can be found here

EDIT: If you want to use command line eutilities,

efetch -db nuccore -id NC_015600 -format fasta_cds_aa -mode text > sequence.txt

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Siva1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1075 users visited in the last hour