Question: Requesting help building a bacterial database file consistent with gibVirus.fa
gravatar for bgold04
4.3 years ago by
United States
bgold040 wrote:



The genome information broker for viruses (gibVirus) contains a .fasta file with over 18,000 full length viral genomes ( ).  This is useful for viral detection and providing integration boundaries using a perl script developed under an old version of MOSAIK assembler ( ).  So, anyway, back at the ranch, I would like to build a similar fasta file for all the bacteria at NCBI or ENSEMBL, but I find the .fasta are all in subfolders containing gbk and many other files.  I have looked at some kind of wget or mget script coupled with FTP to those sites, but I don't see how to retrieve just the .fna (or .fa or .fasta) within the folders, and retrieving the whole collection appears a giant undertaking.  Are there ideas or code for doing this that are manageable & economical (in terms of space)?  FYI, Ensembl bacteria is here: and the NCBI bacterial site is here: bacterial genomes.  This is a related question on Biostars: Where Can I Download Nucleotide Sequences Of Bacterial Genes?  ( I *did* look at NCBI eutil, but I haven't a clue how I would use it to do this…).

4 hours later:  I am thinking I might need to do something like this, but I am not certain precisely how to do it:

2 days later:  As it turns out, Ikuo Uchiyama, who curates the Microbial Genome Database for Comparative Analysis, in Japan, has fasta files (he calls them .dnaseq files) representing 2823 organisms, sufficiently similar to the gibVirus that I should be able to alter them with a series of awk, python or perl scripts to pass muster.

viruses mget bacteria ftp wget • 1.5k views
ADD COMMENTlink modified 4.1 years ago by Biostar ♦♦ 20 • written 4.3 years ago by bgold040
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1075 users visited in the last hour