Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?
I have a list of chromosomal RefSeq ids (e.g. NC_015600
,NC_014498
,NC_012468
..) and I want to get the individual fasta file of all proteins in each chromosome (e.g., NC_015600.faa
, NC_014498.faa
, NC_012468.faa
etc.) from NCBI. Any ideas?
Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)
I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.
usage:
bash script.sh chr.ids
I've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help: http://stackoverflow.com/questions/18107236/using-wildcards-in-wget-or-curl-query