The genome information broker for viruses (gibVirus) contains a .fasta file with over 18,000 full length viral genomes ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1781101/ ). This is useful for viral detection and providing integration boundaries using a perl script developed under an old version of MOSAIK assembler ( http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html ). So, anyway, back at the ranch, I would like to build a similar fasta file for all the bacteria at NCBI or ENSEMBL, but I find the .fasta are all in subfolders containing gbk and many other files. I have looked at some kind of wget or mget script coupled with FTP to those sites, but I don't see how to retrieve just the .fna (or .fa or .fasta) within the folders, and retrieving the whole collection appears a giant undertaking. Are there ideas or code for doing this that are manageable & economical (in terms of space)? FYI, Ensembl bacteria is here: http://bacteria.ensembl.org/info/website/ftp/index.html and the NCBI bacterial site is here: bacterial genomes. This is a related question on Biostars: Where Can I Download Nucleotide Sequences Of Bacterial Genes? ( I *did* look at NCBI eutil, but I haven't a clue how I would use it to do this…).
4 hours later: I am thinking I might need to do something like this, but I am not certain precisely how to do it: http://adina-howe.readthedocs.org/en/latest/ncbi/
2 days later: As it turns out, Ikuo Uchiyama, who curates the Microbial Genome Database for Comparative Analysis, in Japan, has fasta files (he calls them .dnaseq files) representing 2823 organisms, sufficiently similar to the gibVirus that I should be able to alter them with a series of awk, python or perl scripts to pass muster. http://mbgd.genome.ad.jp/htbin/view_arch.cgi