Can anyone give me some idea on how to download all the protein sequences for a set of chromosomes from NCBI?
I have a list of chromosomal RefSeq ids (e.g. NC_015600,NC_014498,NC_012468..) and I want to get the individual fasta file of all proteins in each chromosome (e.g., NC_015600.faa, NC_014498.faa, NC_012468.faa etc.) from NCBI. Any ideas?
Hi, I think you are right Ram. The only way to do this, could be iterate each protein and find which proteins belong to the chromosome of interest. But I think this kind of thing can be reported to NCBI. You can ask them if way to find all protein_id of a given chromosome. This can be a new way to connect data, and can be useful! :)
I was thinking if we could do it in this way below where $1 is the txt file with chromosome ids. In ftp site, under each bacteria, there is a file called NC_XXXXX.faa that contains all proteins for a chromosome. Now the thing is that the wildcard with the wget or curl didn't work here. Is there any way we can make it to work.
usage:
bash script.sh chr.idsI've seen problems with wild cards and curl/wget, but I haven't seen any solution yet. Maybe something here might help: http://stackoverflow.com/questions/18107236/using-wildcards-in-wget-or-curl-query